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Load  balancing  is  a  powerful  technique  to  achieve  high  performance  in  parallel  and 

distributed  computer  systems.  There  are  tasks  in  which  the  computational  structure 

changes  dynamically  during  their  execution.  In  such  cases,  dynamic  load  balancing 

schemes  provide  better  opportunity  to  obtain  high  performance  through  a  dynamic 

reassignment  of  workload  to  the  processors.  A  case  in  point  is  the  class  of  numerical 

algorithms  based  on  adaptive  mesh  refinement  procedures  to  solve  partial  differential 

equations.  In  this  dissertation,  we  develop  efficient  dynamic  load  balancing  techniques 

for  adaptive  mesh  refinement  algorithms.  We  consider  three  different  approaches 

for  assigning  computations  of  dynamically  generated  grids  to  the  processors.  In 

assigning  these  computations,  in  addition  to  minimizing  workload  imbalance,  we  also 

impose  an  explicit  constraint  on  the  inter-processor  communication  cost.  We  propose 

three  different  approaches  for  solving  this  dynamic  load  balancing  problem.  The 

three  approaches  differ  in  granularity  of  grid  computations  assigned  to  the  processors 
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and  the  trade-offs  between  inter-grid  and  intra-grid  communication  costs,  that  are 
imposed  in  assigning  the  computations  to  distributed  memory  parallel  processors. 
For  each  approach,  we  develop  efficient  approximation  algorithms  which  attempt 
to  minimize  both  computational  and  communicational  costs.  We  propose  efficient 
implementation  of  these  algorithms  for  architectures  based  on  mesh  and  hypercube 
topologies.  We  demonstrate  through  experimental  results  how  our  approximation 
algorithms  can  provide  good  solutions  in  practice 
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CHAPTER  1 
INTRODUCTION 


Parallel  processing  has  attracted  a  lot  of  attention  as  an  approach  to  solve 
computationally  intensive  problems.  These  problems  arise  in  many  different  fields  of 
science  and  engineering,  and  their  computational  requirement  may  be  many  hundreds 
of  million  floating  operations  per  second.  Examples  of  such  applications  include 
aerodynamics,  nuclear  physics  and  weather  forecasting.  With  the  advent  of  today's 
VLSI  technology,  building  massively  parallel  systems  with  thousands  of  processors 
is  becoming  a  practical  and  commercial  reality.  These  machines  can  provide  the 
computational  power  needed  to  solve  the  kinds  of  problems  that  arise  in  the  above 
applications. 

Even  though  multiprocessor  systems  have  such  an  enormous  raw  computing  power, 
they  can  be  utilized  only  when  the  problems  are  efficiently  parallelized.  To  keep  all 
the  processors  busy,  we  have  to  partition  the  problem  into  many  components  that 
can  be  executed  in  parallel  by  the  processors  in  the  systems.  This  is  a  very  difficult 
task  since  some  problems  are  inherently  sequential,  so  that  they  cannot  be  ecisily 
parallelized.  Even  in  those  problems  that  have  many  parallelizable  computational 
components,  distributing  the  computational  workload  evenly  among  the  processors 
is  very  important  to  fully  take  advantage  of  the  computing  power  of  multiprocessor 
systems.  The  term  "load  balancing"  refers  to  the  activity  of  distributing  or  redis- 
tributing the  workload  among  the  processors  to  achieve  high  performance. 

Load  balancing  in  distributed  systems  has  been  studied  extensively  by  many  re- 
searchers [12,  14,  23,  31].  In  distributed  systems,  load  balancing  refers  to  the  op- 
eration of  properly  distributing  the  user  jobs  (and  sometimes  resources)  among  the 
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difFerent  nodes  so  that  the  utilization  of  computer  resources  and  job  throughput  are 
maximized.  Load  balancing  in  distributed  systems  shares  many  common  character- 
istics with  that  in  multiprocessor  systems.  Both  activities  have  the  common  goals 
of  achieving  high  performance  through  proper  utilization  of  computational  resources. 
However,  there  are  a  few  important  differences  in  the  above  two  load  balancing  ac- 
tivities. 

In  multiprocessor  systems,  load  balancing  is  performed  to  distribute  highly  related 
subtasks  among  the  processors,  while  in  distributed  systems  user  jobs  are  relatively 
independent  of  each  other  and  do  not  need  to  communicate  with  each  other.  This 
makes  it  important  to  reduce  communication  cost  by  assigning  related  tasks  to  pro- 
cessors located  close  to  each  other.  Thus,  in  multiprocessor  systems,  load  balancing 
operation  has  to  minimize  interprocessor  communication  in  addition  to  balancing 
the  workload  among  the  processors.  In  multiprocessor  systems,  load  balancing  can 
be  limited  to  each  application  program  while  in  distributed  systems,  load  balancing 
activity  has  to  consider  all  the  user  jobs.  For  this  reason,  load  balancing  in  parallel 
computers  can  provide  better  opportunity  for  high  performance  than  in  distributed 
systems.  For  the  same  reason,  cost  of  performing  load  balancing  introduces  higher 
proportional  overhead  to  an  application  program  in  a  parallel  computer  than  in  dis- 
tributed systems.  Hence,  it  is  important  to  minimize  this  load  balancing  overhead  in 
multiprocessor  systems. 

A  load  balancing  algorithm  in  multiprocessor  systems  must  be  designed  so  as  to 
mciintain  a  proper  trade-off  between  performance  improvement  and  load  balancing 
cost;  these  algorithms  can  be  tailored  to  each  application.  The  load  balancing  tech- 
niques used  in  distributed  systems,  on  the  other  hand,  are  difficult  to  tailor  to  a 
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particular  application.  In  this  dissertation,  we  focus  on  load  balancing  in  multipro- 
cessor systems  with  mesh  and  hypercube  interconnection  networks,  which  are  widely 
used  interconnection  networks  in  many  practical  architectures. 

1.1    Load  Balancing  Schemes 

There  are  many  different  load  balancing  schemes  reported  in  the  literature.  In 
general,  they  can  be  classified  as  (a)  centralized  and  decentralized  and  (b)  static 
and  dynamic.  In  centralized  load  balancing  schemes,  load  balancing  decisions  are 
made  by  a  central  site  or  a  central  processor.  When  all  processors  are  identical,  a 
processor  is  designated  as  a  central  processor.  This  processor  collects  from  the  other 
processors  all  the  status  information  (such  as  workload  distribution)  necessary  to 
make  proper  load  balancing  decisions.  This  information  can  be  represented  as  a  task 
graph  which  indicates  not  only  the  computational  load  for  each  computational  module 
(represented  as  a  node)  but  also  the  communication  that  is  needed  between  these 
modules.  Then  the  load  balancing  algorithm  is  employed  to  decide  the  assignment 
or  reassignment  of  processes  to  processors,  and  the  results  of  these  decisions  are 
broadcast  to  all  the  other  processors. 

In  decentralized  load  balancing  schemes,  each  processor  has  to  make  its  own  deci- 
sions about  load  balancing  after  collecting  the  necessary  status  information  from  only 
a  subset  of  all  the  processors;  this  subset  is  usually  restricted  to  neighboring  proces- 
sors or  processors  located  in  a  small  neighborhood.  Since  all  the  processors  participate 
in  making  the  load  balcincing  decisions,  decentralized  schemes  give  quicker  solutions 
compared  to  centralized  schemes.  It  also  takes  less  time  to  collect  the  information 
from  the  subset  of  processors,  and  it  is  not  necessary  to  broadcast  the  results  of 
load  balancing  decisions.  Centralized  schemes  however,  have  the  advantage  of  mak- 
ing more  accurate  decisions  over  decentralized  schemes,  since  in  these  schemes  more 
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accurate  information  is  available  to  make  global  load  balancing  decisions.  In  decen- 
tralized schemes,  workload  is  guaranteed  to  be  balanced  only  in  small  neighborhoods 
due  to  the  lack  of  the  global  information.  This  is  especially  true  if  the  workload 
changes  very  often  and  the  change  is  not  gradual. 

In  static  load  balancing  schemes,  workload  distribution  is  determined  before  the 
tasks  begin  their  execution.  Load  balancing  primarily  involves  assigning  nodes  of  the 
task  graph  to  processors,  and  it  will  not  change  until  all  the  tasks  have  been  executed. 
In  dynamic  load  balancing  schemes,  the  workload  distribution  (task  assignment)  is 
allowed  to  change  during  the  course  of  program  execution.  Static  load  balancing 
schemes  are  suitable  for  these  algorithms  (matrix  multiplication,  FFT  etc.)  where 
the  computational  workload  or  workload  distribution  among  the  processors  does  not 
change  during  the  execution  of  a  parallel  program.  But  there  are  algorithms  (such  as 
numerical  solutions  of  partial  differential  equations  using  adaptive  meshes)  in  which 
the  computational  structure  changes  dynamically  during  their  executions.  If  the  task 
assignment  does  not  change  in  response  to  this  change,  it  will  result  in  degradation  of 
performance  due  to  changes  in  communication  requirements  and  imbalance  in  work- 
locid  among  the  processors.  For  these  algorithms,  dynamic  load  balancing  schemes 
provide  a  better  opportunity  to  obtain  high  performance  through  a  dynamic  reas- 
signment of  workload  in  response  to  the  changes  in  the  computational  structure  of 
the  program. 

1.2    Adaptive  Mesh  Refinement 

An  important  engineering  application  that  will  benefit  most  from  our  load  bal- 
ancing research  is  the  solution  of  partial  differential  equations  using  adaptive  meshes 
[1,  3,  8].  For  this  reason,  we  explain  in  this  section  the  need  for  dynamic  load  bal- 
ancing in  adaptive  mesh-based  algorithms.  The  formulation  of  many  physical  and 
engineering  problems  involves  solution  to  a  set  of  partial  differential  equations  in 
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many  physical  quantities  that  vary  in  time  and  space.  Since  analytical  solutions  to 
such  equations  are  difficult  to  obtain,  numerical  approximations  based  on  iterative 
procedures  are  obtained  using  a  finite  computational  domain  that  is  embedded  within 
the  physical  domain  with  appropriate  boundary  conditions.  Specifically,  a  grid  struc- 
ture (usually  rectangular)  is  imposed  on  the  computational  domain,  and  solutions 
are  evaluated  at  the  grid  points.  Initially,  this  grid  is  spaced  at  regular  intervals  and 
computations  are  performed  at  grid  points.  We  may  want  to  partition  the  computa- 
tional domain  in  such  a  way  that  each  processor  is  assigned  an  equal  number  of  grid 
points. 

Though  the  solutions  of  the  partial  differential  equations  are  smooth  in  most 
regions  of  their  domains,  the  boundary  layers  or  isolated  regions  with  steep  gradients 
or  shocks  may  often  introduce  unacceptable  errors  in  the  solutions  if  a  coarse  grid  is 
employed  to  cover  the  entire  domain.  Hence  the  grid  or  mesh  size  must  be  related 
to  the  required  error  tolerance  in  the  approximation.  Yet  at  the  same  time,  use  of 
a  uniform  fine  grid  covering  the  entire  region  will  result  in  excessive  computations. 
For  this  reason,  adaptive  mesh  refinement  is  employed  during  the  algorithm  to  place 
finer  subgrids  at  locations  that  have  large  estimated  error  values.  In  solving  a  time- 
dependent  problem,  the  regions  of  steep  error  change  with  time  and  hence  also  the 
need  for  adaptive  mesh  refinement.  The  most  popular  and  simplest  approach  [1,  3] 
to  adaptive  mesh  refinement  uses  a  weighting  function  W  that  is  proportional  to  a 
measure  of  the  error  is  calculated  at  all  grid  points.  The  mesh  size  is  then  adjusted 
so  that  W  multiplied  by  As  is  nearly  the  same  at  all  the  points  in  the  domain. 

The  adaptive  mesh  refinement  techniques  illustrate  the  type  of  challenge  faced  by 
dynamic  load  balancing.  The  computational  load  is  initially  uniform  over  the  entire 
region  but  as  finer  subgrids  are  defined,  the  load  distribution  changes.  This  may 
result  in  a  drastic  change  in  allocation  of  computations  to  the  processors.  It  may 
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also  cause  some  processors  to  run  out  of  memory  space  when  newer  fine  grids  are 
introduced  in  some  regions.  An  additional  difficulty  is  encountered  in  solutions  of 
time-dependent  problems.  Here  the  step  sizes  for  time  are  related  to  mesh  size,  so 
that  coarse  grid  and  fine  grid  computations  must  be  synchronized  in  some  fashion. 
This  may  result  in  processor  idle  time  unless  load  allocation  to  processors  changes 
during  the  transition  from  the  completion  of  a  fine  grid  step  to  the  beginning  of 
a  coarse  grid  step.  Furthermore,  computations  related  to  generation  of  finer  grids 
and  interaction  between  component  grids  (required  for  interpolation  of  coarse  grid 
values  to  obtain  boundary  values  for  finer  grids  or  updating  coarse  grid  values  using 
component  grid  values)  must  themselves  be  parallelized  and  workload  due  to  these 
computations  must  be  taken  into  account  by  the  load  balancing  algorithm, 

1.3    Dissertation  Contribution 

Even  though  load  balancing  is  a  very  powerful  technique  to  achieve  high  perfor- 
mance in  multiprocessor  systems,  finding  a  proper  distribution  of  workload  is  often 
a  computationally  intractable  problem.  Hence  exhaustive  enumeration  or  use  of  a 
heuristic  search  technique  such  as  branch-and-bound  is  required  to  obtain  the  optimal 
solutions.  But  these  algorithms  are  very  expensive  to  run  in  real  time  because  the 
computation  time  increases  exponentially  with  the  size  of  the  problem.  For  this  rea- 
son, we  seek  an  approximate  solution  that  is  reasonably  close  to  an  optimal  solution 
and  which  can  be  obtained  quickly.  Finding  such  an  approximate  solution  is  a  chal- 
lenging task  in  a  situation  where  computational  and  communicational  requirements 
change  dynamically,  as  in  adaptive  mesh  refinement.  The  following  issues  need  to  be 
addressed  in  solving  the  dynamic  load  balancing  problem. 

1.  How  do  we  partition  the  computational  domain  at  the  beginning  of  the  program 
execution  ?  This  problem  is  trivial  in  the  case  of  uniform  meshes,  and  for  non- 
uniform meshes,  techniques  have  been  proposed  in  the  literature  e.g.  [8,  28]. 
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2.  When  do  we  have  to  perform  remapping  of  computations  ?  In  general,  for 
situations  where  the  workload  distribution  changes  continuously,  we  need  a 
mechanism  for  deciding  when  to  perform  remapping  based  on  rate  of  change 
of  workload,  ratio  of  computation  to  communication  (also  called  average  grain 
size)  and  cost  of  load  balancing.  In  the  case  of  adaptive  mesh  refinement, 
a  logical  choice  is  to  rebalance  the  workload  among  the  processors  whenever 
we  modify  the  existing  grid  structure.  Thus,  dynamic  load  balancing  can  be 
triggered  by  user  programs  in  this  case. 

3.  How  do  we  remap  computations  to  the  processors?  Answering  this  question 
will  be  the  major  focus  of  this  dissertation.  Our  goal  is  to  design  an  efficient 
parallel  load  balancing  algorithm  that  will  require  many,  if  not  all,  processors 
to  participate  in  the  load  balancing  process.  Then  we  interleave  load  balancing 
activity  between  computations  in  the  execution  of  the  program.  The  informa- 
tion used  in  the  load  balancing  algorithm  is  somewhat  global  in  contrast  to 
decentralized  load  balancing  strategies. 

In  this  dissertation,  we  focus  on  developing  dynamic  load  balancing  methods  that 
are  especially  applicable  for  adaptive  mesh  refinement.  Specifically,  we  propose  three 
different  dynamic  load  balancing  approaches  for  the  adaptive  mesh  refinement  al- 
gorithm used  for  solving  non-time  dependent  problems.  These  approaches  differ  in 
granularity  of  computation.  The  first  approach  eliminates  the  need  for  intra-grid 
communication  to  obtain  neighboring  grid  point  values  within  a  grid,  resulting  in 
coarse  grain  computation.  But  we  still  need  inter-grid  communication  to  send  the 
boundary  values  and  to  project  fine  grid  solutions  on  coarse  grids.  In  the  second  and 
third  approaches,  we  need  both  intra-grid  communication  and  inter- grid  communi- 
cation, resulting  in  finer  grain  computation.  But  they  are  more  flexible  approaches 
thein  the  first  one. 
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There  are  many  load  balancing  algorithms  which  merely  balance  the  workload 
among  the  processors  while  ignoring  the  communication  overhead  between  processors. 
Our  load  balancing  algorithms  will  consider  both  computational  and  communicational 
costs  in  rebalancing  the  workload.  The  communication  cost  will  be  treated  as  an  ex- 
plicit constraint,  or  the  combined  costs  of  both  computation  and  communication  will 
be  used  as  an  objective  function  in  our  load  balancing  algorithms.  Since  finding  an 
optimal  solution  for  the  load  balancing  problem  is  computationally  intractable,  we 
propose  heuristic-based  load  balancing  strategies.  For  the  first  and  third  approaches, 
we  formulated  the  problems  and  developed  efficient  heuristic  algorithms.  Our  algo- 
rithms can  be  efficiently  parallelized  so  that  all  or  a  subset  of  processors  participate  in 
making  load  balancing  decisions  using  global  information.  For  the  second  approach, 
we  also  formulated  the  problem  and  show  that  it  can  be  reduced  to  a  load  balancing 
problem  with  resource  migration  in  distributed  systems  [31].  The  performance  of 
our  algorithms  are  tested  through  simulation  because  analytic  evciluation  is  difficult 
to  do.  In  presenting  our  dynamic  load  balancing  algorithms,  we  will  use  distributed 
memory  models  of  parallel  computations  with  hypercube  and  mesh  topologies. 

The  orgcinization  of  the  dissertation  is  as  follows:  In  the  next  chapter,  we  survey 
several  approaches  used  for  the  load  balancing  in  various  applications.  In  chapter 
3,  we  explain  the  first  approach  and  describe  the  formulation  of  the  problem  and 
heuristic  algorithms  in  detail.  In  addition  to  analyzing  the  complexities  of  these  al- 
gorithms, we  also  present  the  results  of  experiments  showing  the  performance  of  our 
algorithms.  In  chapter  4,  we  describe  the  second  approax:h  and  show  the  similarities 
to  the  load  balancing  problem  formulated  in  [31].  We  do  not  pursue  this  approach  in 
detail  in  this  dissertation.  In  chapter  5,  we  describe  the  formulation  of  the  problem 
based  on  the  third  approach  and  show  how  it  can  be  reduced  to  a  variation  of  the  well 
known  two-dimensional  bin  packing  problem.  We  also  present  an  eflScient  heuristic 
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algorithm  to  solve  the  two-dimensional  bin  packing  problem  and  present  experimen- 
tal results  for  the  performance  of  the  algorithm.  Finally,  we  give  the  summary  of 
dissertation  and  discuss  scope  of  future  work  in  chapter  6. 


CHAPTER  2 
SURVEY  OF  RELATED  WORK 

In  this  chapter,  we  survey  the  work  in  the  area  of  dynamic  load  balanc- 
ing in  multiprocessor  systems.  In  the  first  section,  we  present  a  few  adaptive  mesh 
generation  methods.  They  share  the  common  sequence  of  operations  in  solving  par- 
tial differential  equations  such  as  solving  a  set  of  linear  equations,  error  estimation, 
clustering  and  grid  generation.  But  they  differ  in  many  respects,  such  as  the  finite 
difference  scheme  used,  error  estimation  method,  the  form  of  mesh,  partial  differen- 
tial equations  solved,  and  so  on.  In  the  second  section,  we  present  a  computational 
model  called  WaTor.  This  model  has  been  used  as  a  paradigm  in  the  evaluation  of 
dynamic  load  balancing  strategies  proposed  by  many  researchers.  We  also  present 
some  of  the  most  frequently  applied  load  balancing  methods.  These  methods  include 

•  binajy  decomposition 

•  scatter  decomposition 

•  diffusion  scheme 

We  also  discuss  a  few  other  related  methods,  and  finally  we  consider  the  applicability 
of  those  methods  to  adaptive  mesh  refinement. 

2.1    Adaptive  Mesh  Refinement 

Adaptive  mesh  refinement  is  a  numerical  method  for  solving  some  partial  differen- 
tial equations  where  the  solution  is  not  smooth  in  some  regions  of  the  domain.  In  this 
method,  a  coarse  grid  structure  is  imposed  initially  on  the  computational  domain, 

and  solutions  are  evaluated  at  the  grid  points.  Then  the  error  rates  are  estimated  at 
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the  grid  points  and  those  with  high  estimated  errors  are  clustered.  The  fine  grids  (i.e. 
grids  with  smaller  spacing)are  imposed  on  these  clustered  regions,  and  the  same  pro- 
cedure is  applied  recursively  to  these  regions.  Since  fine  grids  are  used  only  in  small 
regions  of  the  computational  domain  where  the  solution  is  not  smooth  (e.g.  sharp 
gradients,  shocks),  the  amount  of  computation  is  substantially  reduced  compared  to 
using  a  fine  grid  for  the  entire  domain.  The  saving  of  computation  is  even  greater 
when  the  time  step  size  is  related  to  the  mesh  size  in  solving  time  dependent  prob- 
lems. In  this  section,  we  survey  three  different  adaptive  mesh  refinement  methods. 
The  first  and  third  methods  are  used  for  solving  non-time  dependent  problems  while 
the  second  method  is  used  for  solving  time  dependent  problems. 

Altas  and  Stephenson  introduced  a  two-dimensional  adaptive  mesh  generation 
method  [3].  They  demonstrated  the  application  of  their  method  to  solving  linear  and 
nonlinear  partial  differential  equations.  In  their  method,  nonuniform  mesh  is  gener- 
ated using  a  crude,  inexpensive  solution  first,  then  the  equation  is  solved  using  the 
adaptive  grid  generated.  Error  is  estimated  in  each  subregion  instead  of  at  each  mesh 
point,  and  the  regions  in  which  the  estimated  errors  are  beyond  a  given  tolerance, 
are  subdivided  into  four  equal  subregions.  Because  of  the  nonuniformity  of  the  mesh, 
clcissical  difference  schemes  such  as  central  difference  or  forward  difference  scheme 
cannot  be  directly  applied.  Seven  different  difference  formulae  were  used,  depending 
on  the  types  of  mesh  points. 

Berger  and  Oliger  used  multiple  component  grids  for  the  solution  of  hyperbolic 
partial  differential  equations  [8].  They  flagged  the  grid  points  where  error  estimates 
are  larger  than  a  given  tolerance.  The  flagged  points  are  clustered  and  a  fine  grid 
covering  those  points  is  defined.  A  fine  grid  is  completely  inside  a  coarse  grid.  The 
same  procedure  is  recursively  repeated,  so  that  a  grid  at  level  /  is  contained  by  a  grid 
at  level  /  —  1.  The  mesh  ratio  (/1///1/+1  where  hi  and         are  the  mesh  widths  at 


12 


level  /  and  /  +  1  respectively)  of  any  two  successive  grids  is  assumed  to  be  a  constant 
integer  to  make  it  easy  to  determine  the  order  of  grid  integration.  The  same  factor 
is  used  to  set  the  time  step  on  the  grid  at  a  certain  level.  Thus  the  smallest  time 
step  is  applied  on  only  the  finest  grid  and  not  the  entire  domain.  In  each  refinement 
level,  rectangular  grids  are  generated,  which  cover  the  flagged  points.  Since  some 
unflagged  points  are  also  included  in  the  rectangles  that  define  new  grids,  efficiency 
of  computation  depends  on  how  clustering  is  performed  for  subgrid  generation.  The 
mesh  refinement  algorithm  is  succinctly  stated  as  follows 


1.  Recursive  Procedure  Integrate(/) 

2.  Repeat  {meshjratiof  times 

3.  Estimate  errors  for  all  grids  at  level  /  and  above 

4.  Flag  points  with  high  errors 

5.  Cluster  the  points  and  define  grids  at  level  /  +  1 

6.  Step  Aff  on  all  grids  at  level  /  (*  Integrate  over  time  *) 

7.  If  (level  /  +  1  exists) 

8.  ThenBegin 

9.  Integrate(/  +  1) 

10.  Update(/,/+ 1) 

End 

End 

11.  /  =  O(*coarsest  grid  level*) 

12.  Integrate(/) 


At  steps  3,  4  and  5,  error  estimation  is  performed  on  each  mesh  point  of  the  current 
grid  and  flagged  points  are  clustered  to  determine  the  location  and  orientation  of 
the  fine  grid  that  includes  the  flagged  points.  Since  the  new  grids  are  introduced 
only  in  some  regions,  this  step  causes  a  dynamic  change  in  the  computational  work 
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associated  with  each  unit  area  of  the  domain.  At  step  10,  communication  between 
meshes  at  level  /  and  /  + 1  is  performed  to  update  the  coarse  grid  values  by  projecting 
the  fine  grid  solution  values  onto  the  coarse  grid  points.  Time  is  refined  as  well  as 
space,  so  large  time  steps  are  taken  on  coarse  grids  and  small  time  steps  on  fine  grids. 
At  every  level  /  and  for  every  time  step  Ati,  computations  of  all  the  grids  at  level  / 
and  above  (finer  grids)  are  synchronized. 

Acharya  and  Moukalled  [1]  used  a  local  mesh  refinement  approach  for  solving 
convection-diffusion  problems.  When  a  local  mesh  refinement  method  is  used,  por- 
tions of  the  domain  with  high  error  estimation  are  flagged  and  clustered,  and  refine- 
ment is  performed  only  on  these  areas  while  other  areas  are  not  disturbed.  Unlike  the 
previous  method,  this  does  not  require  the  refined  regions  to  be  rectangles.  They  use 
a  different  error  estimation  formula  than  the  previous  methods,  but  the  mesh  points 
with  high  error  rates  are  flagged  and  clustered  as  in  the  previous  method.  Generating 
the  next  level  grid  involves  a  coordinate  transformation.  This  is  done  by  solving  a 
system  of  Poisson  equations.  An  ordinary  difference  scheme  is  used  to  discretize  the 
refined  region,  and  the  resulting  system  of  algebraic  equations  are  solved  iteratively. 
The  refinement  continues  until  the  desired  level  of  accuracy  is  obtained  as  in  the 
previous  approach.  The  generalized  procedure  is  described  by  the  following  steps  : 

1.  Define  a  preliminary  coarse  grid  (flo)  in  curvilinear  coordinates  in  the  domain, 
and  obtain  a  solution  Vo  on  it. 

2.  For  the  first  level  of  refinement  {i  =  1),  flag  grid  points  if  normalized  weighting 
function  values  exceed  a  specified  limit.  For  higher  levels  of  refinement  (i),  flag 
grid  points  with  normalized  weighting  function  values  higher  than  those  in  the 
previous  level  of  refinement  (i  —  1).  Thus  flagged  regions  (f2,)  are  generated  at 
any  level  of  refinement,  and  typically,  f2,  is  embedded  in  fi.-i. 

3.  Generate  a  finer  mesh  in  each  flagged  region  of  fl,-. 
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4.  Interpolate  boundary  conditions  along  the  boundaxies  of  each  flagged  region 
using  the  available  solution  of  the  previous  mesh  (K-i)- 

5.  Obtain  the  solution  Vi  in  each  flagged  region  fi,-. 

6.  Calculate  correction  terms  in  Qi  for  the  equations  on  the  previous  mesh  (fi.-i). 
Solve  the  new  set  of  equations  in  fi.-i  to  obtain  an  improved  (i  -  1)  level 
solution,  Vi-i. 

7.  Repeat  steps  4-6  until  a  specified  level  of  convergence  is  obtained. 

8.  Advance  to  the  next  level  of  refinement  and  repeat  steps  2-7. 

9.  Proceed  until  the  solution  at  the  finest  desired  level  of  refinement  Vk  is  obtained. 

The  computation  and  communication  phases  are  interleaved  as  in  the  previous 
mesh  refinement  algorithm.  In  steps  4  and  6,  grid-to-grid  communication  is  performed 
for  both  interpolation  and  projection.  In  steps  2,  3  and  5,  weighting  functions  are 
computed  to  measure  error  rate  at  each  mesh  point  and  the  solution  is  obtained  for 
the  refined  region,  so  they  correspond  to  computation  phases. 

2.2    Various  Load  Balancing  Approaches  for  Parallel  Computation 

In  this  section,  we  present  many  different  load  balancing  methods  found  in  the 
literature.  First  we  present  a  computational  model  "WaTor,"  which  is  a  popular 
paradigm  used  in  describing  situations  that  involve  dynamically  changing  workload. 
We  use  this  model  to  explain  some  of  the  load  balancing  methods  that  we  describe 
in  the  subsequent  sections. 

2.2.1    WaTor  (Watery  planet  with  Torodial  topologv) 

WaTor  is  an  algorithm  which  simulates  fish  and  sharks  living,  moving,  breeding, 
and  eating  in  a  two-dimensional  ocean.  It  is  an  example  of  Monte  Carlo  particle 
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dynamics  simulation  where  the  behavior  of  data  is  unpredictable.  It  was  first  intro- 
duced by  Dewdney  [22]  and  later  used  to  illustrate  the  load  balancing  problem  in 
multiprocessor  systems  [25,  32].  The  problem  domain  of  WaTor  is  a  two-dimensional 
ocean  with  wrapping  borders.  The  two  dominant  denizens  in  the  ocean  are  sharks 
and  fish.  They  constantly  move  in  the  ocean  according  to  the  kind  and  density  of  fish 
in  their  neighborhood.  The  fish  give  birth  to  the  same  kind  of  fish  and  they  die  after 
a  certain  period  of  time.  Sharks  eat  fish,  which  also  results  in  the  death  of  fish.  With 
the  above  actions  of  sharks  and  fish,  the  problem  is  to  determine  the  distribution  of 
sharks  and  fish  after  a  certain  time.  The  initial  distribution  is  given  as  a  parameter 
of  the  problem.  Figure  2.1  shows  an  example  of  the  distribution  of  sharks  and  fish. 
Fish  axe  represented  by  small  dots  and  sharks  by  larger  ones. 

WaTor  provides  a  good  paradigm  for  computations  in  multiprocessor  systems 
where  dynamic  load  balancing  is  useful.  The  workload  here  involves  computations 
required  to  simulate  various  actions  such  eis  moving,  eating,  breeding,  and  dying.  The 
amount  of  work  is  approximately  the  same  for  each  shark  and  fish.  For  multiprocessor 
systems,  we  can  decompose  the  domain  into  the  same  number  of  regions  as  the  number 
of  processors,  so  that  each  processor  is  responsible  for  approximately  an  equal  number 
of  sharks  and  fish.  But,  no  matter  how  we  decompose  and  assign  the  domain,  the 
workload  changes  in  each  processor  when  fish  and  sharks  move,  breed  and  die.  These 
activities  cause  dynamically  changing  and  fluctuating  workload  in  multiprocessor 
systems. 

2.2.2    Binarv  Decomposition 

Berger  and  Bokhari  propose  binary  decomposition  [7]  as  a  load  balancing  strategy 
to  partition  a  computational  domain  in  which  computational  work  is  non-uniformly 
distributed,  into  subdomains  requiring  equal  computational  effort.  The  idea  behind 
this  partitioning  method  is  as  follows.   Assuming  that  work  estimates  on  a  given 
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Figure  2.1.  An  Example  of  the  Distribution  of  Sharks  cind  Fish 
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domain  have  already  been  obtained,  we  make  a  vertical  cut  through  the  domain,  so 
that  the  left  and  right  segments  each  contain  roughly  half  the  work.  Next,  we  make 
two  horizontal  cuts  on  the  regions  obtained  above,  so  that  upper  and  lower  segments 
of  each  contain  half  the  workload.  This  procedure  continues  until  the  number  of 
regions  is  equal  to  the  number  of  processors  assumed  to  be  a  power  of  2.  If  we 
perform  Jfc  partitioning  steps,  the  resulting  domain  will  contain  2*  regions.  Figure  2.2 
shows  the  resulting  partitioning  when  binary  partitioning  is  applied  to  the  WaTor 
example  in  figure  2.1. 

There  are  several  advantages  of  this  partitioning  method.  First,  the  workload 
of  the  resulting  partitioned  domain  will  be  perfectly  balanced  among  the  different 
regions.  Second,  by  restricting  regions  to  rectangular  blocks,  four  corners  of  the 
rectangle  are  enough  to  specify  each  region,  and  the  data  structure  jissociated  with 
it  becomes  simple.  Another  advantage  comes  from  easy  local  rebalancing  when  one 
region  gets  more  work.  If  the  workload  of  a  region  increases  by  a  small  amount,  by 
adjusting  the  last  cut  line  in  the  partition,  imbalance  can  be  reduced  by  half.  If 
the  jb  previous  cut  lines  are  adjusted,  the  imbalance  is  reduced  by  2*.  To  analyze 
the  communication  costs,  Berger  and  Bokhari  construct  a  task  interaction  graph 
by  having  rectangular  regions  as  the  nodes  and  having  edges  between  two  nodes 
whenever  the  corresponding  regions  share  a  boundary.  We  can  also  associate  a  weight 
with  an  edge  of  the  task  graph  that  indicates  the  amount  of  communication  that  needs 
to  take  place  between  the  corresponding  regions.  The  resulting  task  interaction  graph, 
though  planar,  is  not  necessarily  a  mesh  graph.  Hence  communication  is  not  confined 
only  to  neighboring  processors  when  it  is  embedded  into  a  mesh  circhitecture.  The 
upper  bounds  on  the  number  of  edges  and  maximum  degree  of  a  node  in  this  tcisk 
interaction  graph  were  ajialyzed  in  [7]. 
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Figure  2.2.  Partitioning  of  Figure  2.1  using  Binary  Decomposition 
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Even  though  binary  decomposition  is  a  good  method  to  balance  the  workload 
accurately  among  the  processors,  the  cost  of  determining  a  cut  line  is  expensive 
when  computational  work  is  non-uniformly  distributed.  In  figure  2.2,  we  have  to 
draw  a  cut  line  so  that  the  numbers  of  fishes  on  both  side  are  equal.  This  requires  the 
gradual  movement  of  cut  line  before  its  final  position  is  found.  Berger  and  Bokhari  do 
not  mention  how  the  binary  decomposition  method  can  be  implemented  in  parallel. 
For  this  reason,  dynamically  applying  this  method  in  the  course  of  computation  is 
prohibitively  expensive  especially  when  it  is  done  by  a  single  processor. 

2.2.3    Scatter  Decomposition 

Scatter  decomposition  is  a  good  mapping  strategy  to  balance  the  computational 
workload  among  the  processors.  It  divides  the  computational  domain  into  a  set  of 
rectangular  regions  of  the  same  size.  The  regions  are  labeled  with  integers  from  0 
to  n.  Then,  we  can  map  region  i  to  processor  i  mod  N,  where  N  is  the  number  of 
processors.  Figure  2.3  shows  the  result  after  we  apply  scatter  decomposition  to  the 
WaTor  example  of  figure  2.1. 

It  was  observed  that  scatter  decomposition  works  well  because  irregular  work- 
loads were  not  statistically  independent;  high  workload  tends  to  appear  in  contigu- 
ous regions  [28].  A  sufficiently  fine-grained  decomposition  will  split  the  region  up, 
and  modular  assignment  will  spread  its  workload  around.  The  workload  among  the 
processors  remains  balanced  throughout  the  course  of  computation,  which  implies 
that  no  dynamic  remapping  is  necessary.  The  finer  the  decomposition  is,  the  better 
balanced  workload  we  caji  achieve.  But  the  amount  of  communication  necessary  in- 
creases in  proportion  to  the  number  of  regions  in  the  set.  A  compromise  must  be 
made  according  to  the  ratio  between  the  computation  speed  and  the  communication 
speed  of  the  target  machine  as  well  as  the  frequency  of  communication  that  needs  to 
occur  between  the  regions.  The  nature  of  the  job,  CPU  bound  or  I/O  bound,  will 
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Figure  2.3.  Partitioning  and  Assignment  of  Figure  2.1  using  Scatter  Decomposition 
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also  affect  the  decision.  The  communication  time  in  message-passing  architectures  is 
usually  divided  into  communication  setup  time  Tsetup  and  data  transfer  time  Tc,  and 
if  Tc  is  rather  small  compared  to  T^ctup,  then  fine-grained  decomposition  is  preferable 
in  terms  of  both  load  balancing  and  communication,  because  once  communication  is 
setup,  additional  data  can  be  transferred  at  little  cost.  Again,  it  should  be  tuned  to 
the  target  machine  to  achieve  good  performance. 

Scatter  decomposition  was  used  as  a  load  balancing  method  in  [32].  A  formal 
analysis  using  a  probabilistic  model  of  workload  was  given  by  Nicol  and  Saltz  [28]. 
The  difficulty  in  using  this  method  is  determining  the  size  of  the  region  which  should 
be  small  enough  to  balance  the  workload,  and  yet  large  enough  to  avoid  excessive 
communication  cost. 

2.2.4    Diffusion  Scheme 

Diffusion  scheme  is  an  example  of  distributed  load  balancing  strategies  [11,  20]. 
It  hcis  an  analogy  with  the  physical  process  of  diffusion.  In  this  method,  the  amount 
of  workload  to  be  assigned  to  a  processor  at  time  <  +  1  is  determined  by  the  work- 
load difference  between  the  processor  and  its  neighboring  processors  at  time  t.  The 
formulation  of  this  scheme  given  in  [20]  is  as  follows. 

wl    '  =  w]'  +  2^  aij{wj  '  -  w]  ')  +  r]i  -C 
i 

where  w\^^  is  the  number  of  tasks  to  be  performed  by  processor  i  at  time  t;  each  task 
hcis  the  same  amount  of  computational  load.  The  summation  is  over  all  processors  j  in 
the  network,  and  a,j  =  1  or  0  depending  on  whether  i  and  j  are  connected  by  a  direct 
link  in  the  network.  The  term  tj^*'^^^  accounts  for  new  additional  work  generated  at 
time  t  for  processor  i.  The  term  C  is  a  constant  amount  of  work  that  any  processor 
can  accomplish  between  time  t  and  time  t  +  1.   Based  on  this  information,  each 
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processor  performs  only  local  load  balancing  operations  by  moving  some  workload  to 
or  from  its  neighboring  processors. 

This  scheme  was  used  for  load  balancing  when  molecular  dynamics  simulation 
algorithms  are  run  on  multiprocessor  systems  [11].  The  time  to  balance  the  new 
workload  is  shown  to  be  O(n^)  (where  n  is  the  number  of  processors).  This  time 
complexity  becomes  unacceptable  if  the  workload  changes  very  frequently  in  some 
processors.  But,  if  the  workload  distribution  changes  very  slowly,  there  is  suffi- 
cient time  to  smooth  the  workload  distribution  through  a  chain  of  local  balancing 
operations.  This  simple  approach  may  be  useful  in  particle  dynamics  simulation  ap- 
plications where  the  workload  distribution  changes  slowly  due  to  particle  migration. 
Cybenko  [20]  showed  the  conditions  under  which  this  scheme  converged  and  the  rate 
of  convergence  for  arbitrary  network  topologies. 

2.2.5    Other  Approaches 

Sadayappan  and  Ercal  introduced  a  strategy  for  mapping  finite  element  graphs 
onto  processor  meshes  [30].  The  finite  element  graphs  they  consider  axe  connected 
planar  graphs  that  consist  of  rectilinear  four  node  elements.  It  can  be  a  simple  reg- 
ular mesh,  irregular  mesh  or  nonmesh  graph.  The  mapping  strategy  considers  both 
load  balancing  and  communication  aspects.  The  graph  is  initially  partitioned  and 
assigned  to  the  processors  in  such  a  way  that  the  nodes  of  an  element  are  assigned 
to  the  same  processor  or  neighboring  processors.  The  amount  of  workload  assigned 
to  each  processor  can  differ  at  this  stage.  A  subsequent  procedure  tries  to  balance 
the  workload  among  the  processors  by  a  simple  boundary  refinement  technique.  Per- 
fect load  balancing  is  not  always  guaranteed  because  the  neaxest  neighbor  mapping 
property  is  enforced  in  the  latter  procedure. 

Simon  [29]  discusses  three  graph-theoretic  techniques  to  balance  the  computa- 
tions that  arise  in  unstructured  grid  calculations  based  on  finite  volume  methods. 
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Typical  applications  here  include  computational  fluid  dynamics  or  structural  analy- 
sis based  on  finite  element  methods.  The  target  machine  here  is  a  MIMD  distributed 
memory  architecture.  All  the  three  techniques  recursively  apply  a  procedure  that 
partitions  a  planar  graph  into  two  halves  with  roughly  equal  number  of  nodes  with 
the  goal  of  minimizing  the  number  of  edges  in  the  cutset.  The  first  method,  called 
"recursive  coordinate  bisection"  assumes  that  the  graph  is  embedded  in  a  coordinate 
(Euclidean)  system  and  partitions  vertices  into  two  halves  along  the  coordinate  di- 
rection of  "maximum  spread" .  The  second  method  called  "recursive  graph  bisection" 
considers  shortest  path  length  (number  of  edges)  from  an  "extremal"  vertex  in  the 
graph  as  distance  measure  and  assigns  vertices  with  smaller  distances  in  one  half;  an 
extremal  vertex  is  located  from  one  of  the  vertices  at  a  distance  equal  to  diameter  of 
the  graph.  The  third  method,  called  "recursive  spectral  bisection"  uses  the  direction 
of  eigen  vector  associated  with  the  second  largest  eigenvalue  of  the  Laplacian  matrix, 
namely  A  —  D  where  A  is  the  adjacency  matrix  of  the  graph  Jind  D  is  the  diagonal 
matrix  of  vector  degrees;  magnitude  of  this  eigen  value  (which  is  negative)  is  a  mea- 
sure of  connectivity  of  the  graph.  The  graph  is  partitioned  using  components  of  the 
eigen  vector  as  vertex  weights.  Based  on  an  example  grid  with  about  6000  vertices, 
the  author  concludes  that  recursive  spectral  bisection  gives  smallest  number  of  edges 
in  the  cutset. 

Karen  Devine  et  al.  [21]  use  a  similar  adaptive  load  balancing  strategy  called 
"tiling"  to  balance  the  work  load  during  the  execution  of  their  parcdlel  adaptive 
finite  element  methods  used  to  solve  hyperbolic  conservation  laws.  Here,  overlapping 
neighborhoods  of  processors  are  defined  where  each  processor  is  at  the  center  of  a 
neighborhood  and  its  adjacent  processors  in  the  architecture  that  share  subdomciin 
boundaries  belong  to  its  neighborhood.  The  idea  here  is  for  each  processor  to  balance 
the  workload  in  its  neighborhood  where  the  workload  is  measured  by  number  of 
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elements  and  per-element  processing  cost;  this  is  done  by  transferring  only  elements 
in  the  subdomain  boundary  to  the  neighboring  processors. 

Williams  used  three  different  approaches  for  load  balancing  on  hypercube  mul- 
tiprocessor systems  [32].  These  approaches  include  scatter  decomposition,  nearest 
neighbor  balancing  and  simulated  annealing.  Nearest  neighbor  balancing  is  very  sim- 
ilar to  diffusion  scheme.  Simulated  annealing  is  a  random  search  technique  that  is 
analogous  to  the  physical  phenomenon  of  annealing.  The  "energy"  associated  with  a 
particular  state  (workload  distribution)  of  the  system  is  measured  by  the  workload 
imbalance  among  the  processors.  A  good  mapping  corresponds  to  a  low  energy  state 
of  the  system  and  is  found  by  the  simulated  annealing  process  in  which  the  tempera- 
ture of  the  system  is  slowly  reduced.  The  temperature  of  the  system  is  a  measure  of 
the  sum  of  computational  and  communicational  loads.  The  process  of  simulated  an- 
nealing is  very  time  consuming  because  of  the  slow  cooling  schedule  that  is  typically 
used.  Thus,  it  is  not  practical  to  use  this  technique  very  often  during  the  course  of 
computation. 

Crutchfield  [19]  introduces  a  load  balancing  technique  and  applies  it  to  adaptive 
mesh  refinement.  The  load  balancing  problem  is  modeled  as  follows.  We  aie  given  K 
boxes  and  N  balls,  where  the  boxes  correspond  to  the  processors  and  balls  correspond 
to  jobs.  The  balls  have  random  weights  based  on  some  distribution.  The  problem  is 
to  assign  the  balls  to  the  boxes  so  that  the  boxes  are  nearly  equal  in  weight.  Let 
be  the  weight  of  the  i-th  ball  in  box  a.  The  load  balancing  efficiency  is  defined  as 

EfficiencyLB  =  ^  ^° 

A  maXa(E,- ) 

In  this  approach,  the  basic  unit  of  parallelization  is  a  grid.  The  fine  grids  created 
in  the  course  of  computation  correspond  to  balls  in  the  above  model.  The  workload 
that  corresponds  to  integration  time  is  estimated  by  a  weighted  sum  of  the  area  and 
perimeter  of  the  grid.  The  integration  time  for  coarse  grid  is  ignored  because  it  is 
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assumed  to  be  small  compared  to  that  of  fine  grids.  Communication  costs  are  also 
ignored.  A  simple  heuristic  algorithm  is  proposed  to  find  an  approximate  solution 
to  the  above  problem.  Experimental  results  show  that  the  algorithm  produces  good 
solutions  when  the  ratio  N/K  is  greater  than  three.  The  ratio  is  an  important 
factor  that  controls  the  uniformity  of  the  load  balance.  Some  performance  results  are 
presented  from  the  experiments  on  Cray  X/MP  and  BBN  TC2000.  They  show  that 
when  the  ratio  NJK  is  high  enough  (  >  3  ),  efficiency  is  close  to  1. 

Hanxleden  and  Scott  [25]  developed  a  testbed  for  different  balancing  techniques 
including  scatter  decomposition,  binary  decomposition  and  some  dynamic  balancing 
strategies.  The  example  problem  used  in  their  experiments  is  WaTor  that  we  dis- 
cussed before.  They  introduce  a  simple  decentralized  balancing  strategy  in  which 
each  node  computes  the  amount  of  its  own  workload,  and  broadcasts  it  to  other 
nodes  as  load  distribution  information;  since  this  is  just  a  single  number,  there  is 
little  communication  overhead.  Each  node  however  does  not  know  the  internal  work- 
load distribution  of  other  nodes  and  hence,  exact  load  balancing  is  not  possible. 
New  mapping  is  determined  assuming  the  workload  is  evenly  distributed  over  the 
entire  domain.  The  above  approaches  were  implemented  on  Intel  iPSC/2  distributed 
memory  parallel  computer  and  their  effects  on  the  message  size  and  the  number  of 
messages  were  observed.  The  granularity  of  mapping  and  the  number  of  processors 
were  Vciried  in  their  experiments.  It  was  observed  that  fast  load  balancing  techniques 
may  yield  better  overall  performance  than  slower  ones  that  provide  a  better  balance. 

SAR  (Stop  At  Rise)  policy  was  proposed  by  Nicol  and  Saltz  [27]  to  determine 
when  to  perform  remapping  of  computations  onto  the  processors.  They  assume  that 
global  remapping  is  done,  the  cost  of  remapping  is  known  beforehand  and  run  time 
performance  can  be  easily  measured.  The  objective  of  the  policy  is  to  minimize  W{n) 
(for  n  processors),  where  W{n)  is  a  measure  of  average  processor  idle  time.  Though 
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they  demonstrate  through  analytical  study  and  experimental  results  that  their  policy 
is  optimal,  the  assumptions  made  may  be  unrealistic  in  some  applications.  Estimat- 
ing W{n)  at  each  iteration  and  global  remapping  of  workload  may  be  prohibitively 
expensive  on  some  architectures. 

2.3    Concluding  Remarks 

In  this  chapter,  we  surveyed  a  few  adaptive  mesh  refinement  techniques  and  many 
different  approaches  for  load  balancing  in  multiprocessor  systems.  Some  of  these 
approaches  are  shown  to  be  effective  for  particular  applications.  But  none  of  them 
appears  to  be  suitable  for  adaptive  mesh  refinement.  Binary  decomposition  is  a 
good  static  load  balancing  technique.  But  it  is  very  costly  to  apply  many  times 
during  the  execution  of  an  algorithm.  Even  though  local  adjustment  of  workload  is 
possible,  it  is  effective  only  when  workload  change  is  confined  to  a  small  region  in  the 
computational  domain.  Scatter  decomposition  provides  good  load  balancing  even 
for  a  dynamically  changing  workload.  But  inter-processor  communication  is  very 
high  in  this  approach.  Diffusion  scheme  is  suitable  for  applications  such  as  particle 
dynamics  simulation  applications  where  the  workload  distribution  changes  slowly  due 
to  particle  migration.  It  is,  however,  not  suitable  for  adaptive  mesh  refinement  based 
numerical  algorithms  where  the  workload  can  change  abruptly  in  a  region. 

Simulated  annealing  is  a  good  random  search  technique  to  escape  from  local  ex- 
trema  and  to  seek  close  to  optimal  solutions.  But  it  is  very  time  consuming  process 
because  of  the  slow  cooling  schedule.  Thus,  it  is  not  a  practical  dynamic  load  bal- 
ancing technique.  In  Crutchfield's  approach  [19],  each  grid  is  treated  as  a  single 
indivisible  task  and  the  load  balancing  problem  becomes  a  bin-packing  problem.  But 
communication  cost  was  completely  ignored  in  this  approach.  In  addition  to  work- 
load balance,  communication  cost  is  a  very  important  factor  in  mapping  of  adaptive 
mesh  computations  onto  multiprocessor  systems.   For  this  reason,  our  goal  in  this 


27 


dissertation  is  to  design  dynamic  load  balancing  algorithms  which  explicitly  consider 
the  trade-off  between  communication  and  computational  costs. 


CHAPTER  3 
GRID  INDIVISIBILITY  APPROACH 

In  this  chapter,  we  discuss  an  approach  for  load  balancing  in  adaptive  mesh 
refinement  algorithms  based  on  the  assumption  that  each  grid  is  a  separate  indivisible 
computational  entity.  First  we  describe  the  characteristics  of  non-time  dependent 
partial  differential  equations  and  explain  why  dynamic  load  balancing  is  necessary 
for  parallel  solution  of  such  problems. 

3.1    An  Adaptive  Mesh  Refinement  Algorithm 

In  non-time  dependent  problems,  it  is  not  necessary  to  move  and  delete  grids  in  the 
course  of  solving  the  equations.  But  a  dynamic  procedure  is  still  necessary  because 
the  regions  of  large  solution  variations  are  not  known  prior  to  the  solutions.  An 
example  of  this  kind  of  partial  differential  equations  is  Laplace's  equation  describing 
the  temperature  distribution  in  an  equilibrium  state. 

u(x,  0)  =  5i(x)  u{L,  y)  =  giiy) 
u{x,H)=g3{x)       u{0,y)  =  g4{y) 

The  above  Laplace's  equation  describes  the  temperature  distribution  in  a  rectangu- 
lar plate  of  width  L  and  height  H.  The  regions  requiring  mesh  refinement  are  found 
by  estimating  errors  in  the  current  solutions  to  the  equations.  When  we  solve  such 
problems,  the  amount  of  computational  workload  for  each  grid  is  approximately  pro- 
portional to  the  number  of  grid  points.  In  addition  to  the  computational  workload, 
the  grids  at  different  levels  need  to  communicate  for  projecting  the  fine  grid  solu- 
tions and  interpolating  boundary  values  for  the  next  level  grids.  When  solutions  are 
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evaluated  at  the  interior  points  of  a  grid,  the  grid  points  in  the  same  grid  also  need 
to  communicate  with  each  other  to  exchange  values.  The  steps  common  to  most 
adaptive  mesh  refinement  methods  include  the  following. 


1.  i  =  0 

2.  Impose  a  coarse  grid  at  level  0  on  the  entire  domain 

3.  while  (true) 

begin 

4.  Obtain  the  solution  to  the  equation  at  mesh  points  of  the 
grids  at  level  i. 

5.  if      0  then 

Project  solutions  of  level  i  grids  onto  grids  at  level  i  —  1. 

6.  Estimate  error  at  each  grid  point  at  level  i  grid  and  flag  the  points 
with  high  errors 

7.  if  there  is  no  flagged  grid  point  then  exit. 

8.  Cluster  the  flagged  grid  points. 

9.  Create  finer  grids  at  level  i  +  1  on  the  flagged  regions. 

10.  Interpolate  boundary  values  for  grids  at  level  i  +  1. 

11.  1  =  1-1-1 
end 


In  line  4,  the  solutions  are  obtained  by  applying  various  finite  difference  schemes. 
The  resulting  systems  of  equations  can  be  solved  using  iteration  methods  of  Jacobian 
or  Gauss-Seidel  type.  In  line  8,  the  grid  points  with  high  estimated  error  are  flagged 
and  clustered.  Some  simple  clustering  algorithms  can  be  found  in  [8,  9].  The  fine 
grids  generated  in  line  9  can  be  rectangular  [8]  or  non- rectangular  [1].  The  above 
while  loop  is  iterated  until  the  desired  level  of  accuracy  is  obtained. 
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In  this  chapter  and  the  next  two  chapters,  we  present  three  different  approaches 
for  mapping  the  computations  of  dynamically  generated  grids  onto  multiprocessor 
systems.  Each  approach  differs  in  granularity  of  computation  and  type  of  communi- 
cation, and  hence  requires  a  different  load  balancing  strategy.  In  these  approaches, 
we  limit  intra-grid  communication  to  neighboring  processors,  but  allow  inter-grid 
communication  between  processors  farther  apart.  We  assume  that  the  underlying  in- 
terconnection network  used  in  the  parallel  architecture  is  a  mesh  or  a  hypercube,  but 
the  first  approach  can  be  easily  extended  to  other  parallel  architectures  with  regular 
interconnection  networks. 

3.2    Problem  Formulation 

In  our  first  approach,  we  consider  each  grid  as  a  separate  indivisible  computational 
entity.  Each  time  a  fine  grid  is  generated,  the  entire  grid  is  assigned  to  a  processor 
according  to  its  computational  load.  Hence  when  solutions  are  evaluated  at  interior 
points  of  grids  (based  on  a  5-point  or  9-point  stencil),  all  necessary  grid  points  are 
available  within  the  processor,  avoiding  the  need  for  intra-grid  communication  cost. 
But  inter-grid  communication  is  still  necessary  whenever  a  fine  grid  is  generated 
within  a  coarse  grid  and  assigned  to  a  processor  different  from  the  processor  the 
coarse  grid  is  assigned  to.  Since  intra-grid  communication  is  performed  more  often 
than  inter-grid  communication,  this  approach  eliminates  the  necessity  for  frequent 
communication  among  the  processors.  But  the  degree  of  parallelism  is  strictly  limited 
by  the  number  of  grids  at  each  level.  At  the  beginning  of  adaptive  mesh  algorithm, 
there  is  only  one  grid  assigned  to  a  processor  and  until  the  number  of  grids  rea<;hes 
the  number  of  processors,  some  processors  have  to  be  idle. 

When  we  formulate  the  load  balancing  problem,  we  use  the  following  assumptions 
in  all  our  approaches  : 
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1.  A  grid  at  level  i  {i  >  0)  is  completely  contained  inside  a  grid  at  level  i  —  I. 
This  is  a  most  commonly  used  assumption  in  adaptive  mesh  refinement  for 
simplification  of  numerical  algorithms.  In  this  case,  we  call  the  two  grids  "child" 
and  "parent"  grids  respectively. 

2.  Grid  computations  are  synchronized  such  that  grid  computations  at  level  i  start 
at  the  same  time.  Dynamic  load  balancing  is  performed  whenever  finer  grids 
are  created  and  before  their  computations  are  started. 

3.  Computational  cost  for  each  grid  depends  only  on  the  number  of  mesh  points 
it  contains.  This  cost  includes  time  for  steps  4,  6,  8,  9  and  10  in  the  previous 
section.  For  simplicity,  we  will  assume  that  the  cost  is  a  linear  function  of  the 
number  of  points,  though  our  formulation  can  be  used  for  any  known  function 
of  number  of  points. 

4.  There  are  two  types  of  interprocessor  communication  called  "intra-grid  com- 
munication" and  "inter-grid  communication"  respectively.  Intra-grid  commu- 
nication is  needed  whenever  computation  of  a  value  at  a  mesh  point  (during 
step  4  in  the  previous  section)  in  a  processor  needs  a  value  at  a  neighboring 
mesh  point  located  in  another  processor;  both  the  mesh  points  belong  to  the 
same  grid  at  some  level.  Inter-grid  communication  occurs  during  steps  5  and 
10  whenever  a  child  grid  is  located  in  a  different  processor  than  that  of  its  par- 
ent. Communication  cost  is  directly  proportional  to  the  distance  between  the 
processors  in  the  network  and  we  ignore  traffic  on  the  communication  links. 

From  the  first  assumption,  it  follows  that  the  relation  among  the  grids  at  different 
levels  can  be  described  by  a  tree.  Each  node  in  the  tree  represents  a  grid.  A  node 
at  level  i  in  the  tree  corresponds  to  a  grid  at  level  i.  The  edges  in  the  tree  denote 
communication  necessary  between  a  parent  and  its  children.  Figure  3.1  shows  an 
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example  of  the  grid  structure  and  the  corresponding  tree.  In  the  figure,  "A"  is  a 
coarse  grid  at  level  0  and  it  contains  "B",  "C",  "D"  which  are  the  fine  grids  at  level 
1.  "E",  "F",  "G"  are  the  fine  grids  at  level  2  and  they  are  contained  in  their  parent 
grids  at  level  1. 

The  nodes  in  the  tree  are  labeled  with  positive  numbers  representing  the  compu- 
tational workload  and  the  edges  are  labeled  with  positive  numbers  representing  the 
amount  of  communication  between  the  nodes.  The  tree  grows  dynamically  as  new 
grids  are  generated. 

Now  the  dynamic  load  balancing  problem  here  is  one  of  assigning  the  fine  grids 
created  at  some  level  to  the  processors  in  the  system  such  that  the  sum  of  computation 
and  interprocessor  communication  costs  is  minimized.  To  formulate  the  execution 
time  including  both  computational  and  communication  costs  of  level  i,  we  introduce 
the  following  notations  : 

G  =  {V,E)  —  processor  graph  where  V  is  the  set  of  processor  nodes  and 

E  the  set  of  communication  links 
S  —  set  of  fine  grids  generated  at  level  i 

f  :  S  —y  V  —  /(a)  is  the  processor  where  grid  a  was  generated 
Ma  —  number  of  mesh  points  in  grid  a 

Ba  —  number  of  mesh  points  on  the  boundary  of  grid  a 

Ucomp  —  computational  cost  per  mesh  point;  depends  on  the  finite  difference 

method  used 

Ucomm         —  cost  of  transmitting  a  mesh  point  value  between  neighboring 
processors 

d{PiiP2)      —  communication  distance  between  processors  pi  and  p2  in  G 
g  :  S  —^V   —  grid  assignment  function  to  be  determined  by  load  balancing 
h  :V     2^  —  inverse  of  g;  h{p)  is  the  set  of  grids  assigned  to  processor  p 
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Figure  3.1.  An  Example  of  Grid  Structure  and  the  Corresponding  Tree 
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Note  that  only  one  processor  is  responsible  for  performing  the  computations  associated 
with  a  single  grid.  Hence  inter-processor  communication  occurs  only  due  to  inter- 
grid  communication.  For  a  grid  a,  if  /(a)  ^  g{a),  then  communication  cost  equal  to 
Ucommd{f{a),g{a))Ba  is  incurred  in  sending  the  boundary  values  for  grid  a  to  proces- 
sor g{a);  after  computing  values  for  grid  a,  a  cost  equal  to  Ucommd{g(a),  f(a))Ma  is 
incurred  in  sending  these  values  back  to  its  parent  grid  for  updating.  Then  the  total 
execution  time  of  level  i  is  as  follows. 

Ti  =  Ucomp  max  f  E  ^«  I  +  ^comm  max  {d{f{a),  g{a)){Ba  +  M„)) 

Uhp)  ) 

The  first  term  represents  the  computation  time  to  solve  the  problem  on  the  grids 
assigned  to  processor  p.  It  also  includes  the  costs  of  error  estimation,  clustering  and 
generating  the  next  level  grids.  The  second  term  represents  the  inter-grid  communi- 
cation cost.  Thus  our  load  balancing  problem  using  the  grid  indivisibility  approach 
can  be  stated  as  follows  :  Given  G,  5,  /,  determine  g  (and  hence  h)  such  that 
Ti  is  minimized. 

This  problem  can  be  considered  as  a  general  process  assignment  problem  in  mul- 
tiprocessor systems  if  we  treat  grids  as  processes.  We  assume  that  processes  at  level 
i  generate  child  processes  at  level  i  \.  The  parent  processes  wait  until  they  re- 
ceive results  from  their  child  processes,  then  terminate.  When  the  child  processes 
are  executed  on  processors  different  from  those  on  which  parent  processes  are  exe- 
cuted, there  is  communication  cost  for  migrating  the  child  processes  and  receiving 
messages  from  them.  If  we  assume  that  the  execution  time  of  each  process  and  the 
size  of  message  between  parent  and  child  are  known,  then  our  problem  becomes  one 
of  distributing  the  processes  among  the  processors  such  that  the  sum  of  computation 
and  interprocessor  communication  costs  is  minimized. 

This  type  of  computation  can  also  be  found  in  algorithms  that  employ  search 
trees  to  explore  a  solution  space  (examples  include  branch-and-bound,  best-first. 
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breadth-first,  depth-first  etc.).  Branch-and-bound  algorithm  is  a  search  method  that 
is  useful  for  solving  optimization  problems.  It  generates  state  space  tree  similar  to 
Figure  3.1.  Since  bounding  functions  are  used  to  prune  nodes  in  this  tree,  the  tree 
grows  dynamically  at  various  levels  as  the  searching  proceeds. 

Our  load  balancing  problem  can  easily  be  shown  to  be  intractable.  If  we  ignore 
communication  cost  and  assume  that  there  are  only  two  processors,  the  2-partition 
problem  can  be  reduced  to  this  problem.  Since  the  2-partition  problem  was  aJready 
shown  to  be  NP-complete  [24],  this  problem  is  NP-hard.  Finding  an  optimal  solu- 
tion to  this  problem  will  be  computationally  intractable.  Therefore,  we  seek  efficient 
algorithms  which  produce  near  optimal  solutions.  Moreover,  we  are  interested  in 
heuristic  load  balancing  algorithms  that  can  be  efficiently  implemented  on  parallel 
architectures  with  regulcir  interconnection  topologies  such  as  meshes  and  hypercubes. 
In  the  next  section,  we  propose  a  simple  heuristic  for  seeking  an  efficient  approximate 
solution  to  this  problem. 

3.3    Proposed  Solutions 

In  the  formulation  of  our  load  balancing  problem,  we  try  to  minimize  the  sum 
of  two  costs,  computation  and  communication.  The  relative  importance  of  one  over 
another  depends  on  the  nature  of  the  problem  we  are  solving  and  the  ratio  of  compu- 
tational to  communicational  cycle  times  of  the  machine  on  which  the  problem  will  be 
solved.  If  the  frequency  of  communication  ("granularity  of  computation")  is  very  low, 
balancing  the  workload  is  much  more  important  than  reducing  the  communication 
cost.  On  the  other  hand,  low  granularity  of  computation  makes  reducing  communi- 
cation cost  much  more  important  than  balancing  the  workload  axnong  the  processors. 
In  the  extreme  Ccise,  parallelizing  an  algorithm  does  not  provide  the  desired  benefit 
when  the  communication  between  processes  running  on  different  processors  dom- 
inates the  computational  load.  In  the  applications  we  consider  namely  solution  of 
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partial  differential  equations,  however,  the  massive  amount  of  computation  performed 
relative  to  communication  sufficiently  justifies  seeking  parallel  solutions. 

In  the  approach  we  propose,  we  treat  inter-grid  communication  cost  as  a  constraint 
and  try  to  find  an  assignment  of  grids  which  minimizes  the  maximum  computational 
cost  while  satisfying  the  constraint.  With  this  strategy,  the  formulation  of  our  load 
balancing  problem  is  modified  as  follows  : 

Find  /  (and  hence  g)  such  that  maXpgv  HaeHp)  ^comp^<^  minimized  with 
the  constraint  that  T[i&XaesUcommd{f{a),g{a)){Ba  +  Ma)  <  Cmax  where  Cmax  is 
the  acceptable  limit  for  inter-grid  communication  cost.  The  parameter  Cmax 
provides  the  desired  trade-off  between  computational  and  communication  costs  and 
can  be  set  based  on  the  nature  of  the  application  and  the  architectural  parameters 
such  as  ratio  of  communication  bandwidth  to  computational  capacity  in  the  system. 
Allowing  Cmax  to  be  large  will  result  in  a  well  balanced  workload  at  the  expense  of 
high  communication  cost,  while  a  small  Cmax  may  result  in  a  badly  balanced  workload 
but  low  communication  cost.  In  our  heuristic  algorithm,  Cmax  will  be  used  to  limit 
the  maximum  distance  that  each  grid  can  migrate  from  the  processor  where  it  was 
generated. 

3.3.1    Proposed  Heuristic 

Our  load  balancing  problem  is  identical  to  the  multiprocessor  scheduling  problem 
[16]  if  the  communication  constraint  is  ignored.  Given  N  processors  and  a  finite  set 
T  =  {<i,^2,  •  •  •  ,im}  of  tasks  with  execution  time  of  each  task  e(<,)  £  Z"*",  the  multipro- 
cessor scheduling  problem  is  finding  a  partition  of  T  into  disjoint  sets  Ti,T2, . . .  ,Tn 
such  that  max,  J3tj€r,^(^j)-  There  are  many  well-known  heuristics  such  as  first-fit, 
decreasing-fit  and  multi-fit  [17]  which  provide  approximate  solutions  that  are  within 
constant  factor  (less  than  twice)  of  the  optimal  solution.  These  heuristics  are  however 
sequential  in  nature  and  also  they  ignore  the  communication  constraint. 
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In  the  heuristic  we  propose,  we  first  assign  a  label  ta  to  each  grid  a  that  indicates 
the  maximum  distance  that  it  can  migrate  from  its  currently  assigned  processor  dur- 
ing the  load  balancing  process.  Initially,  =  [ucom^rB'+Ma)\ '  inversely 
proportional  to  the  size  of  the  grid.  Each  time  a  grid  needs  to  move  by  distance  d 
during  a  load  balancing  step,  ta  decreases  by  d  and  when  ta  becomes  zero,  it  remains 
assigned  to  its  current  location. 

For  balancing  the  load,  we  use  a  recursive  procedure  similar  to  binary  decomposi- 
tion, that  is,  first  balancing  load  between  two  halves  of  processors  and  then  applying 
the  procedure  recursively  to  the  two  halves.  Figure  3.2  shows  the  first  four  levels  of 
this  recursive  procedure  for  mesh  multiprocessors.  In  order  to  balance  the  workload 
between  two  halves,  we  order  the  grids  in  each  half  in  non-increasing  order  of  ta 
values  (only  grids  with  strictly  positive  ta  values  are  considered)  and  for  the  same  ta 
values  in  increasing  order  of  Ma  values  (which  are  measures  of  computational  cost). 
We  move  grids  from  overloaded  half  to  underloaded  half  in  the  above  order.  The 
intuition  behind  this  order  is  that  if  grids  with  larger  ta  values  are  allowed  to  migrate 
first,  they  can  also  migrate  during  the  later  iterations  of  the  load  balancing  algorithm 
allowing  a  better  chance  to  balance  the  workload.  Also  from  the  point  of  view  of  grid 
migration  cost,  smaller  sized  grids  are  preferred  over  larger  sized  grids. 

We  consider  two  approaches  which  we  call  "concurrent"  and  "parallel"  approaches 
for  implementation  of  this  heuristic  on  a  multiprocessor  system.  In  both  these  ap- 
proaches, each  processor  first  creates  a  "token"  or  a  "packet"  for  each  grid  that  it 
has.  This  token  contains  information  about  the  grid  such  as  its  size,  presently  as- 
signed location  (processor),  ta  value  etc.  During  the  load  balancing  algorithm,  only 
these  tokens  are  manipulated  and  at  the  end  of  the  algorithm,  each  processor  knows 
where  each  of  its  grids  has  to  be  moved.  The  actual  migration  of  grids  is  not  part 
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Figure  3.2.  The  Pairs  of  Sets  of  Processors 
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of  the  load  balancing  algorithm;  note  that  the  grid  migration  cost  is  included  in  the 
communication  cost  constraint. 

In  the  concurrent  approach,  every  processor  first  broadcasts  all  the  tokens  it  has, 
to  every  other  processor  in  the  system.  During  the  first  iteration,  each  processor  does 
the  same  computation,  namely  attempt  to  balance  the  load  between  two  halves  of 
processors  using  the  proposed  heuristic.  During  the  second  iteration,  all  the  proces- 
sors in  a  half  do  the  same  computation,  namely  attempt  to  balance  the  load  between 
two  quadrants  of  that  half.  Thus  this  process  proceeds  upto  log  N  (assuming  N  to  be 
a  power  of  2)  iterations,  where  during  the  i-th  iteration  (0  <  i  <  log  N  —  1),  a  group 
of  ^7  processors  performs  the  same  tcisk  of  balancing  workload  between  two  halves 
of  the  group.  An  advantage  of  this  method  is  that  every  processor  determines  the 
new  locations  of  its  grids  independently  and  hence  it  can  initiate  the  migration  of  a 
grid  as  soon  as  its  new  location  is  determined.  Major  disadvantages  of  this  method 
include  redundancy  in  computation  and  excessive  message  broadcasting  in  the  first 
step  especially  when  number  of  grids  is  quite  large. 

In  the  parallel  approach,  we  exploit  the  parallelism  available  in  the  architecture 
in  executing  our  load  balancing  algorithm.  We  perform  efficiently  all  the  steps  of 
our  algorithm  in  parallel  but  at  the  end  of  the  algorithm,  each  processor  may  have 
to  be  informed  by  another  processor  regarding  the  new  location  of  its  grid.  This 
method  will  be  shown  to  be  clearly  more  advantageous  over  the  first  method  when 
the  number  of  tokens  is  large  and  when  two  messages  destined  for  the  same  processor 
are  allowed  to  be  "combined"  in  a  processor. 

In  the  next  two  sections,  we  give  details  of  the  algorithm  for  the  two  approaches 
and  analyze  their  time  complexities.  In  describing  our  algorithm,  we  use  a  mul- 
tiprocessor architecture  with  a  hypercube  topology  though  the  algorithm  is  easily 
applicable  to  other  regular  topologies  such  as  meshes. 
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3.4    Concurrent  Approach 
The  following  steps  are  performed  in  the  load  balancing  algorithm: 

(a)  Each  processor  creates  for  each  grid  a  that  it  has,  a  token  (or  a  packet)  con- 

taining the  following  information  :  (i)  Ma,  size  of  the  grid,  (ii)  ta,  maximum 
distance  that  it  can  migrate  from  its  presently  assigned  location  (for  a  hyper- 
cube  network,  it  is  at  most  log  TV)  and  (iii)  pa,  currently  assigned  processor  for 
a  which  is  initially  set  to  /(a)  (processor  where  it  Wcis  generated). 

(b)  Each  processor  broadcasts  these  tokens  to  every  other  processor  in  the  system. 

For  this  step,  we  can  use  either  the  multinode  broadcasting  algorithm  for  hyper- 
cubes  given  in  [10]  repeatedly  or  use  the  multinode  broadcasting  algorithm  for  a 
ring  [10].  The  former  algorithm  is  off-line  meaning  that  each  processor  knows 
in  advance  the  different  communication  steps  of  the  algorithm;  also  it  cannot 
be  used  in  a  pipelined  mode.  The  latter  algorithm  is  on-line  and  it  can  also  be 
used  in  a  pipelined  fashion  for  broadcasting  of  many  tokens. 

(c)  Each  processor  P»  (0  <  5  <  TV— 1)  then  executes  the  procedure  Load  balance(5) 

concurrently  with  the  above  information. 

Procedure  Load  balance  (s); 

1.  for  A;  =  1  to  logiV  do  steps  2-12. 

2.  Let  A  be  the  set  of  processors  with  the  k  most-significant  bits  in  their 
addresses  same  as  in  5. 

3.  Let  B  be  the  set  of  processors  whose  addresses  are  obtained  by 
complementing  the  A:-th  most  significant  bit  of  those  in  A. 

4.  Let  A  =  \La-  Lb\/2  where  La  =  E{a|p„e>i}      and  Lg  =  E{a|p„€B}  M^. 
These  are  the  total  workloads  associated  with  processors  in  A  and  B 
respectively. 
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5.  If  A  <  e  (where  e  is  a  small  number  denoting  load  imbalance  tolerance  limit) 
then  go  to  step  1  for  the  next  iteration, 

else  increment  A  by  e  (meaning  that  after  load  balancing  new  workload  of  B 

can  exceed  that  of  A  by  c)  and  do  steps  6-12. 

Without  loss  of  generality,  assume  that  La  >  Lb  and 

let  5"  =  {a\pa  €  A,  Ma  <  A  and     >  0}. 

S  is  the  set  of  candidate  grids  for  migration. 

6.  Find  Dmin  =  min{fa|a  G  S}  and  Dmax  =  ma.x{ta\a  €  S}. 

7.  Partition  S  into  5'£)„,„, . . . , where  Sj  =  {a  €  ^l^o  =  j}- 

8.  For  j  =  Dmax  to  Dmin  do  steps  9-12. 

9.  Let  Tj  =  {flj,  oj, . . . ,  a/}  where  a,-  €  Sj,  and  A/a-  <  A  for  all  1  <  z  <  /. 

10.  SelectGrid  (T,-,A,r). 

11.  For  each  grid  a,-  €  Tj,  if  Afa^  <  Ma^  and  Afa,  <  A  then  change  pa,  to 

a  processor  address  obtained  by  complementing  the  A;-th  most  significant 
bit  of      and  update  A  to  A  —  A/o,  - 

Do  this  step  first  for  grids  a,-  with  Mai  <  then  for  grids  a,- 

with  A/a,  =  Afar- 

12.  If  A  <  c  then  exit  loop  of  step  8  and  go  to  step  1. 
end  Load  balance; 

Procedure  SelectGrid(Vr,  A,  r) 

//  Given  a  set  of  grids  W  =  {01,02,. . .  ,o/},  and  a  positive  integer  A,  returns  an 
index  r  such  that  the  following  is  true  : 

53  A/,   <   A  and 

Mi   >   A,  where 
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Wr    =    {1  <  i  <  l\Mai  <  Ma,}  &nd 
Wl    =   {l<i</|M„,  <Ma,} 

Procedure  SelectGrid  is  just  a  weighted  selection  problem  and  can  be  solved  by  a 
divide-and-conquer  algorithm  just  as  in  the  selection  problem  [2].  // 

As  for  the  complexity  of  the  algorithm,  step  (a)  takes  0{Mmax)  time  where  Mmax 
is  the  maximum  number  of  grids  that  a  processor  initially  has.  Step  (b)  takes 
0{Mmax\^j:j)  time  where  M  is  the  total  number  of  grids;  we  assume  that  each  packet 
or  token  takes  unit  time  to  be  transmitted  along  a  communication  link.  It  is  also 
possible  to  use  an  on-line  algorithm  for  step  (b)  by  using  the  multinode  broadcasting 
algorithm  for  a  ring  that  takes  0{M  +  N)  time.  Hence  the  complexity  of  step  (b) 
is  0{mm{M  +  N,  Mmax\^}^))-  Step  (c),  namely  the  procedure  Load  balance  takes 
0{{M  +  log  TV)  log  N)  time  derived  as  follows  :  for  each  iteration  of  step  1,  (i)  steps 
2-7  take  0(M)  time,  and  (ii)  procedure  SelectGrid  takes  0(|VF|)  time  and  hence 
steps  9-12  take  0(M)  time  for  all  the  iterations  of  loop  8.  The  total  time  complexity 
of  the  algorithm  is  0  (min(M  -|-  N,  Mmaxj^)  +  M  log  iV) . 
Mesh  Network  : 

Our  load  balancing  algorithm  can  be  easily  applied  to  mesh  multiprocessors.  We 
assume  for  convenience  that  the  number  of  processors  TV  is  a  power  of  2.  As  in 
Figure  3.2,  the  workload  is  balanced  between  left  and  right  halves  of  processors,  then 
between  upper  and  lower  quadrants  of  processors  and  so  on.  We  need  to  go  through 
log  TV  steps  as  in  hypercube  multiprocessors.  Other  parts  of  the  algorithm  remain 
unchanged.  Since  the  maximum  distance  between  two  processors  is  only  log  TV  in 
a  hypercube  while  it  is  y/N  in  a  mesh,  we  have  better  opportunity  to  balance  the 
workload  with  a  hypercube  topology.  The  time  complexity  of  the  algorithm  for  a 
mesh  topology  is  analyzed  as  follows.  Step  (a)  takes  0{Mmax)  time,  and  step  (b) 
takes  0{M  +  TV)  time  if  we  use  the  on-line  multinode  broadcasting  algorithm  for  a 
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ring.  Step  (c)  takes  0{M  log  N)  time  from  the  same  reasoning  as  above.  The  total 
time  complexity  of  the  algorithm  for  a  mesh  topology  is  0{N  +  MlogN). 

3.4.1  Improvement 

In  step  (b)  of  our  heuristic  algorithm,  all  the  processors  have  to  broadceist  their 
packets  at  the  same  time.  Since  the  communication  links  between  processors  are 
used  to  transmit  a  number  of  messages,  this  will  cause  a  serious  traffic  congestion. 
To  avoid  this  problem,  we  designate  a  processor  (Py/N  ,  ^/n  ,  for  a  mesh  and  Pq -o  for 
a  hypercube  with  N  processors),  and  have  all  the  processors  send  their  packets  to  the 
processor  before  the  first  iteration  of  load  balancing.  This  processor  is  responsible 
for  determining  the  distribution  of  grids  between  the  two  halves  of  processors.  After 
this  is  done,  we  designate  two  other  processors,  one  in  each  half  of  the  processors, 
and  send  the  updated  packets  to  the  two  processors.  Then,  the  same  procedure  is 
applied  recursively  to  the  two  halves  of  processors  during  the  second  iteration  of  load 
balancing.  At  the  last  step  of  the  algorithm,  load  balancing  is  performed  for  each 
pair  of  processors  with  one  of  the  processors  responsible  for  load  balancing.  At  the 
end  of  the  algorithm,  a  processor  that  holds  a  packet  has  to  inform  the  processor  in 
which  the  grid  was  generated  about  its  new  location.  Unlike  the  previous  algorithm, 
the  actual  migration  of  grids  can  start  only  after  the  load  balcincing  algorithm  is 
completed. 

With  the  above  modifications,  we  can  avoid  multinode  broadceist  of  step  (b)  be- 
fore procedure  Load-Balance  starts  execution.  Also  there  is  no  redundancy  in 
computation  as  in  the  previous  algorithm.  Instead,  we  need  a  single-node  gathering 
operation  to  collect  all  the  packets  at  a  designated  processor.  If  we  use  the  same  idea 
as  in  multinode  broadcast,  single-node  gather  takes  0{M  +  N)  time  on  a  mesh  and 
0(min(Af  +  N,  Mmaxj^))  time  on  a  hypercube.  In  addition  to  a  single-node  gath- 
ering operation,  we  need  additional  communication  for  transmitting  packets  from  a 
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designated  processor  to  the  next  two  designated  processors.  This  involves  sending  at 

most  M  packets  to  two  processors,  no  more  than  -r^rr  distance  away,  where  D  is 

the  diameter  of  the  network  and  k  is  the  iteration  number  (A;  =  0,l,...,logA^  —  1). 

For  this  we  use  pipelined  transmission  of  packets  between  two  nodes  but  restricted 

to  each  quadrant  (submesh  or  subcube  of  size  ^  of  processors).  For  a  mesh,  it  takes 

0(M  +  ^^)  time  and  for  a  hypercube,  it  takes  0(M  +  -^-)  time.  Thus  the  total 
21—7—1  2'~i~' 

time  complexity  of  the  modified  concurrent  algorithm  is  as  follows. 

Mesh  :  0(m  +  N  +  E'k=o~' +  ^)  +  ^ log 

=  0{M\ogN  +  N  +  ^/N) 

=  0{M\ogN  +  N) 

Hypercube  :  O  (min(M  +  N,  M^^.j^^)  +  Z'^io'^M  +  ^)  +  Mlog  iv) 

=  0  (min(M  +  N,  M^a.  ^)  +  M\ogN  +  log  N) 

=  o{M\ogN  +  mm{M  +  N,Mma.i^)) 

The  above  modifications  do  not  improve  the  time  complexity  of  our  load  balancing 
algorithm  because  single- node  gather  takes  the  same  time  as  multinode  broadcast. 
But,  in  a  single- node  gathering  operation,  the  packets  destined  for  the  same  processor 
can  be  combined  into  a  larger  packet  at  intermediate  nodes.  If  the  time  for  trans- 
mitting a  larger  packet  is  the  same  as  the  time  for  transmitting  a  smaller  packet, 
single-node  gather  takes  only  0{Mmax  +  VN)  time  for  a  mesh  and  {Mmax  +  log  N) 
time  for  a  hypercube  [10].  The  additional  communication  for  transmitting  packets 
from  a  designated  processor  to  the  next  two  designated  processors  in  each  quadrant 
takes  0{M^ax  +  ^^^)  on  a  mesh  and  0{Mmax  +  ^^^)  on  a  hypercube  (after  com- 
bining all  packets  first)  for  the  A;-th  iteration.  Thus  if  combining  packets  is  allowed, 
the  total  time  complexity  of  the  load  balancing  algorithm  is  0{M  log  N  -\-  y/N)  for  a 
mesh  and  0{M  log  N  +  log  N)  =  0{M  log  TV)  for  a  hypercube. 
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When  there  are  a  small  number  of  packets  (for  example  when  M  =  o(j^)  for 
a  mesh,  or  M  =  o{^^)  for  a  hypercube),  combining  packets  and  using  the  second 
approach  is  advantageous  compared  to  the  first  approach  especially  when  only  on-line 
routing  algorithms  are  used.  Also  the  second  approach  in  general  will  avoid  too  much 
packet  transmission  on  the  average,  and  also  it  avoids  off-line  routing  algorithms  for 
hypercube  if  combining  is  allowed. 

3.5    Parallel  Approach 

The  concurrent  implementation  of  the  load  balancing  algorithm  discussed  in  the 
previous  section  can  be  efficient  when  the  number  of  grids  M  is  small.  But  when  M 
is  very  large,  the  complexity  of  selection  of  grids  in  each  iteration,  namely  0(M), 
makes  the  cost  of  load  balancing  very  high.  For  this  reason,  we  propose  an  alterna- 
tive implementation  in  this  section  in  which  there  is  no  redundancy  in  computation 
and  the  task  of  determining  the  grid  allocations  is  performed  cooperatively  by  the 
processors. 

As  in  the  previous  approach,  the  processors  initially  generate  tokens  or  packets 
one  for  each  grid  it  has.  These  packets  contain  essentially  the  same  information  as 
before.  First,  the  packets  are  evenly  distributed  among  the  processors.  This  token 
distribution  step  is  important  for  distributing  among  the  processors  the  workload 
associated  with  the  remaining  steps  of  the  load  balancing  algorithm.  We  perform 
logA'^  iterations  as  before  where  during  the  i-th  iteration  (0  <  t  <  log  TV  —  1),  load 
balancing  is  performed  between  two  halves  of  a  subcube  of  dimension  log  TV  —  i;  we 
call  each  such  half  a  "segment".  In  each  iteration,  the  following  major  steps  are 
performed  :  (a)  sort  the  packets  such  that  all  packets  whose  new  processor  locations 
belong  to  the  same  segment  reside  in  a  set  of  contiguous  processors;  within  the  same 
segment  they  are  in  the  order  in  which  the  corresponding  grids  will  be  considered 
for  migration,  (b)  selection  of  grids  (i.e.  packets)  for  migration.  At  the  end  of  the 
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algorithm,  a  processor  that  holds  a  packet  informs  the  processor  in  which  the  grid 
WcLS  generated  about  its  new  location.  The  actual  migration  of  grids  happens  only 
after  the  load  balancing  algorithm  is  completed. 

Below  we  give  a  formal  description  of  our  algorithm.  In  the  description,  we  use 
parallel  algorithms  for  well-known  computational  problems  such  as  segmented  scan 
(prefix),  summing,  sorting  etc.  and  packet-routing  problems  such  as  token  distribu- 
tion, broadcasting,  packing,  monotone  routing  etc..  We  will  describe  these  problems 
but  will  refer  to  the  literature  for  efficient  algorithms  for  some  of  these  problems. 

Algorithm  Load  balance; 

(a)  Each  processor  creates  for  each  grid  a  it  has,  a  packet  (token)  with  the  following 

information  :  (a)  Ma,  size  of  the  grid,  (b)  ta,  maximum  distance  it  can  migrate 
from  its  currently  assigned  location,  (iii)  /(a),  index  of  processor  where  it  Weis 
generated  and  (iv)  Qa,  index  of  the  segment  containing  the  processor  the  grid 
is  currently  assigned  to;  this  is  initially  set  to  the  most  significant  bit  of  f(a). 

(b)  Distribute  the  packets  evenly  among  the  processors  using  procedure  Tok- 
enDistribution. 

(c)  Perform  logiV  iterations  where  in  the  /-th  iteration  (0  <  /  <  logA'^  -  1),  call 

procedure  AdjustDistribution(/). 

(d)  Now  each  processor  may  contain  packets  whose  Qa  fields  (that  indicate  final 
locations  of  the  grids)  are  different  from  their  /(a)  fields  (source  locations); 
we  "mark"  such  packets.  Now  these  marked  packets  need  to  be  sent  to  the 
processors  where  the  corresponding  grids  were  generated.  This  is  a  many- 
to-many  routing  problem  for  which  efficient  solutions  exist  when  two  packets 
destined  for  the  same  processor  are  allowed  to  be  combined.  We  use  either 
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procedure  InformSourcel  or  procedure  InformSource2  depending  on 
whether  the  packets  are  allowed  to  be  combined  or  not. 

end  Load  balance; 
Procedure  TokenDistribution; 

1.  Each  processor  Pi  numbers  its  packets  from  1  to  m,  where  m,  is  the  number  of 
packets  it  heis. 

2.  Determine  Mmax  =  max,  m,  using  the  parallel  procedure 
FindMax({m.}ilo\M„.ax,iV  -  l). 

3.  Call  procedure  Broadcast(Mmai,  ^  —  !)• 

4.  Let  k  =  0. 

5.  For  j  =  1  to  Mmax  do  steps  6-10. 

6.  Call  procedure  Broadcast(/:,  iV  —  1). 

7.  Each  Pi  sets  a,-  =  1  if  it  has  a  packet  indexed  j,  else  sets  cq  =  0. 

8.  Call  parallel  procedure  Scan({a,}j^o^,  {6,}ilo^))       are  returned. 

9.  Each  Pi  sends  its  j-th  packet  (if  it  exists)  to  the  processor 
indexed  {k  +  6,-  —  1)  mod  N  using  procedure  Pack. 

10.  Pn-1  sets  k  =  {Bn-i  +  k)  mod  N. 
end  TokenDistribution; 

Procedure  AdjustDistribution(/); 

1.  Sort  packets  according  to  lexicographical  ordev  of  the  fields  <  Qa,ta,Ma  > 
where  Qa  and  Ma  are  in  increasing  order  and  ta  is  in  decreasing  order. 

For  packet  with  equeil  values  of  Qa,ta  and  Ma,  their  relative  order  after  packing 
can  be  arbitrary. 

For  this,  we  use  a  parallel  sorting  scheme  for  hypercubes. 

2.  Each  processor  numbers  its  packets  from  1  up  to  [^] . 
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Let  PSj  be  the  set  of  contiguous  processors  that  have  packets  a  with  Qa  =  j, 
{0  <  j  <2'  —  1).  Assume  these  sets  are  disjoint  (see  Remark  below). 

3.  Concurrently  execute  parallel  procedure  ComputeLoadDifFerence  (j) 
forall;  =  0,1,...,2'-1. 

Now  each  processor  in  Pj  knows  Aj,  half  the  load  difference  between  two  halves 
of  its  segment  namely  Pj  and  Pj+(_i)> ;  let  gj  =  1  if  the  load  in  PSj  is  greater 
than  the  load  in  PSj^^iy  and  Aj  >  c,  0  otherwise. 

Let  PSj  C  PSj  be  the  set  of  contiguous  processors  that  have  packets  a  with 
Qa  =  j  and     =  k,  where  0  <  ;  <  2'  -  1  and  1  <  ifc  <  log  iV. 
Again  assume  these  sets  are  disjoint  (see  Remark  below). 

4.  Execute  parallel  procedure  SegmentedScan  on  Ma's  to  obtain  M^'', 
for  all  0  <  J  <  2'  -  1  and  1  <  A:  <  log  A^.  Here  M^''  is  the  sum  of  Mi's  of 
packets  b  with  tb  =  k  and  which  exist  in  those  processors  of 

PSj  indexed  smaller  than  the  processor  containing  packet  a  and  of  those 
packets  indexed  no  greater  than  a  in  the  same  processor  containing  a. 

5.  Concurrently  execute  parallel  procedure  SelectPackets(j)  for  0  <  j  <  2'  —  1 
such  that  gj  =  1.  After  this  step,  certain  packets  are  "marked"  to  be  migrated 
to  the  other  halves  of  their  segment. 

6.  Each  processor  does  the  following  for  each  marked  packet  a  it  has: 
complement  the  /-th  most  significant  bit  of  Qa  and  decrease     by  1. 

7.  Ifl  ^  log  A'^  then  each  processor  updates  the  segment  number  of  each  packet  a 
that  it  has,  by  concatenating  the  /+  1-th  most  significant  bit  of  /(a)  to  the 
right  of  Qa. 

end  AdjustDistribution; 

Remark  :  We  have  assumed  that  the  sets  of  processors  PSj  (and  hence  the  sets 
PS{j))  are  disjoint.  If  they  are  not,  then  note  that  a  processor  P,-  may  need  to 
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participate  in  at  most  2  parallel  procedures,  one  corresponding  to  the  set  of  processors 
to  the  left  of  P,  and  another  to  the  right  of  P,.  By  dividing  the  execution  cycle  of 
each  processor  into  2  phases,  one  for  the  left  and  another  for  the  right,  we  only  incur 
a  constant  factor  slow  down. 

Procedure  ComputeLoadDifFerence(_y); 

/ /  Processors  in  PSj  participate  in  this  procedure  / / 

1.  Call  parallel  procedure  Sum({Ma}Q„=j,  Zj,  tj),  where  ij  is  the  smallest 
index  of  a  processor  in  PSj. 

2.  For  even  j.  Pi-  broadcasts  Lj  to  all  the  processors  in  PSj  U  PSj^.^^iy  as  follows  : 
Let  a,  =  Li  for  i  =  ij,  0  otherwise. 

Call  pajallel  procedure  SegmentedScan  on  a,'s;  note  that  this  step  must  be 
synchronized  among  all  j's. 

3.  For  odd     do  the  same  as  in  step  2. 

4.  Each  processor  in  PSj  now  knows  Lj  and  Lj^^^iy  and  computes  Aj  and  gj. 
end  ComputeLoadDifference; 

Procedure  SelectPackets(j); 

/ /  Processors  in  PSj  participate  in  this  procedure  / / 

1.  For  Jb  =  1  to  logiV  do  steps  2-3. 

2.  Esich  processor  in  PSj  "marks"  a  packet  a  if  M^*  <  Aj  +  e. 

3.  Identify  the  processor  in  PSj  that  holds  the  marked  packet  with  maximum 
value  of  Ma  (say  Vj);  this  is  done  by  checking  if  the  right  adjacent 
processor  in  PSj  has  a  marked  packet.  This  processor  updates  Aj  to  Aj  —  Vj 
and  broadcasts  it  to  all  the  processors  in  PSj  using  an  operation 
identical  to  step  2  of  ComputeLoadDifFerence. 

end  SelectPackets; 
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Procedure  InformSourcel; 

1.  Sort  the  marked  packets  according  to  their  f{a)  fields  using  a  parallel  sorting 
scheme  for  hypercubes.  For  0  <;<  AT  -  1,  let  T,  be  the  set  of  contiguous  pro- 
cessors with  /(a)  =j.  We  assume  Tj's  to  be  disjoint  without  loss  of  generality 
by  the  previous  remark. 

2.  Call  parallel  procedure  SegmentedScan  on  these  packets  where  the  associa- 
tive operation  is  defined  to  be  merging  of  two  packets  with  the  same  destination 
into  one  packet.  After  this  step,  the  least  indexed  processor  in  Tj  keeps  the 
merged  packet  while  others  in  Tj  discard  their  outputs.  Now  we  have  at  most 
one  packet  destined  for  a  processor  but  a  processor  may  have  more  than  one 
such  packet  to  send.  Each  processor  P,  that  has  at  least  one  such  packet, 
indexes  the  packets  (in  the  order  of  their  destinations)  from  1  to  m|  where 

1  < <  f . 

3.  for  j  =  1  to  ^  do  the  following  :  Each  processor  routes  its  j-th  packet  (if  it 
exists)  to  its  destination  using  procedure  MonotoneRoute. 

end  InformSourcel; 
Procedure  InformSource2; 

1.  Sort  the  maxked  packets  as  in  step  1  of  InformSourcel.  Each  processor  in- 
sets a,-  to  index  of  first  marked  packet  it  has;  a,-  =  0  if  P,-  has  no  marked  packets 
to  send. 

2.  While  there  is  a  marked  packet  to  send  do  the  following  step. 

3.  Each  processor  Pi  with  a,-  ^  0  and  if  the  destination  of  a,-  (i.e.  /(a,))  is  not  the 
same  as  that  of  a,_i  sends  its  a,-th  packet  to  its  destination  and  then  updates  a,- 
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to  the  index  of  next  market  packet  or  0  if  it  has  no  more  such  packets.  Routing 
is  done  using  procedure  MonotoneRoute. 

end  InformSource2; 

Procedure  FindMax  or  Sum({a,}j^o\ ^> 

//  Given  the  a,-  values  with  approximately  ^  values  per  processor,  find  the  maximum 
or  sum  (6)  of  these  values  as  the  case  may  be  and  store  it  in  the  processor  Pk]  takes 
0(f +  logiV)  time.  // 

Procedure  SegmentedScan  or  Scan  [26]; 

//  Compute  prefix  or  suffix  sums  on  contiguous  segments  of  data.  For  example  for 
inputs  {2, 3},  {1, 7, 2},  {1,3, 6),  the  prefix  sums  are  {2, 5},  {1,8, 10},  {1,4, 10}  and  the 
suffix  sums  are  {5, 3},  {10, 9, 2},  {10, 9, 6}.  Addition  can  be  replaced  by  any  associa- 
tive operator  in  this  procedure.  Scan  operation  is  just  a  special  case  of  segmented 
scan.  Given  M  such  segmented  inputs  with  approximately  ^  inputs  per  processor, 
this  procedure  works  in         +  log  A^)  time.  // 

Procedure  Broadcast(j,  V); 

II  Processor  Pj  broadcasts  value  V  to  all  the  N  processors;  takes  O(logiV)  time  in 
a  hypercube  [26].  // 

Procedure  Pack; 

//  Here  a  subset  of  processors  Pjo,Pji  ,...,Pj,  with  jo  <  ji  <  •  •  ■  <  ji  has  one  packet 
each  and  it  is  desired  to  store  the  packet  of  Pj^  in  Pk  foi  t  <  k  <  {t  +  I)  mod  m  for 
some  0  <  t  <  N  —  1.  Using  greedy  routing,  this  operation  can  be  done  in  O(logiV) 
time  [26].  // 

Procedure  MonotoneRoute; 

/ /  Here  a  subset  of  processors  needs  to  send  a  packet  each  and  the  destinations  of 
the  packets  satisfy  the  following  property  :  for  two  processors  P,-  and  Pj  in  the  subset 
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with  i  <  j,  the  destination  of  P.'s  packet  is  indexed  smaller  than  the  destination  of 
Pj's  packet.  This  problem  can  be  solved  in  0(log  N)  time  on  a  hypercube  [26].  // 

Time  complexity  analysis  : 

Procedure  TokenDistribution  takes  0{Mmax  log  N)  time  derived  as  follows  :  Step 

1  takes  0{Mmax)  time  and  steps  2  and  3  take  O(log  N)  time;  steps  6,  8  and  9  all  take 
O(log  N)  time  for  each  iteration  of  step  5. 

Procedure  ComputeLoadDifFerence  takes  C>(^+log  N)  time  as  the  operations 
performed  in  it  are  summing  and  segmented  scan.  By  the  same  reasoning,  procedure 
SelectPackets  also  takes         +  log//)  time. 

Now  the  complexity  of  procedure  AdjustDistribution  is  analyzed  as  follows 
:  step  1  takes  Ts{M,  N)  time  which  is  the  time  to  sort  M  items  on  N  processors 
with  approximately  equal  number  of  items  (in  fact  O(^)  items)  per  processor,  step 

2  takes  O(^)  time  and  each  of  the  steps  3,  4  and  5  takes  +  logN)  time; 
steps  6  and  7  take  O(^)  time.  Hence  the  complexity  of  AdjustDistribution  is 
0{Ts{M,  N)  +  ^  +  log  N)  time.  When  M  <  N,v/e  can  use  odd-even  merge  sort  [26] 
to  sort  the  packets  in  0(log^7V)  time.  When  M  >  N,  M  packets  can  be  sorted  in 
Q^MlogM^  time  on  a  hypercube  using  the  algorithm  given  in  [26]. 

The  complexity  of  procedure  InformSourcel  is  derived  next.  Step  1  takes 
0{Ts{M\  N))  time  where  M'  is  the  number  of  marked  packets;  note  that  the  number 
of  marked  packets  per  processor  is  at  most  ^.  Hence  Ts{M'  ,N)  =  0{\og^N)  when 
M  <  iV  and  equal  to  0(^'^«^)  otherwise.  Step  2  takes         +  log  N)  time  and  step 

3  takes  log  A^)  time.  Hence  the  complexity  of  InformSourcel  is  0{Ts{M,  N)  + 
^logiV).  As  for  procedure  InformSource2,  step  1  takes  the  same  time  as  for  In- 
formSourcel. The  number  of  iterations  of  step  3  is  at  most  Mmax  +  ^  =  0{Mmax)- 
Hence  the  complexity  of  the  procedure  is  0{Ts{M,  N)  +  M^ax  log  N). 
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Now  we  are  ready  to  analyze  the  complexity  of  our  load  balancing  algorithm. 
Step  (a)  takes  0{Mmax)  time  and  step  (b)  takes  0(M„ax  log  N)  time.  Step  (c)  takes 
0{Ts{M,  N)  +  ^  +  log  N)  time  for  each  of  log  N  iterations.  Thus  when  M  <  A'^,  the 
algorithm  has  complexity  0{Mmax  log  N  +  log^TV  +  f  log  N)  when  combining  packets 
is  allowed  and  0{Mmax^ogN  +  log^TV  +  M„»ax  log  iV)  in  the  absence  of  combining. 
When  M>N,  the  algorithm  has  complexity  0{Mmax  log  iV+ log  iV+ f  log  N) 
when  combining  packets  is  allowed  and  0{M^ax  log  N  +  ^^^^  log  N  +  Umax  log  N) 
in  the  absence  of  combining. 

The  asymptotic  complexity  of  the  algorithm  is  seen  to  be  the  same  in  the  presence 
and  absence  of  combining  packets  in  step  (d).  This  is  due  to  the  fact  that  the  token 
distribution  step  (step  (b))  dominates  in  complexity  over  step  (d).  If  step  (b)  can  be 
done  faster  than  0{Mmax  log  A'^)  (which  we  believe  so)  then  combining  packets  may 
reduce  the  complexity  of  the  algorithm. 

3.5.1    Mesh  Network 

Our  parallel  load  balancing  algorithm  can  be  implemented  on  a  mesh  with  only 
a  few  changes.  As  was  done  in  the  concurrent  approach,  the  workload  is  balanced 
between  left  and  right  halves  of  processors,  then  between  upper  and  lower  quadrjints 
of  processors  and  so  on.  We  need  to  go  through  log  A'^  steps  as  in  a  hypercube.  Other 
parts  of  the  algorithm  remain  unchanged.  For  the  analysis  of  time  complexity  of  our 
load  balancing  algorithm  on  a  mesh,  we  first  consider  parallel  algorithms  for  well- 
known  computational  and  routing  problems  used  in  our  load  balancing  algorithm 
cind  their  time  complexities  on  a  mesh.  The  number  of  processors  is  assumed  to  be 
N.  Also  we  assume  that  the  processors  in  the  mesh  are  ordered  in  a  snake-like  row 
major  order. 

Time  complexity  analysis  for  a  mesh  : 
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Broadcasting  a  value  to  all  the  N  processors  by  a  processor  takes  0{y/N)  time. 
Segmented  scan  of  M  segmented  inputs  with  approximately  ^  inputs  per  processor 
takes  0{^  +  y/N)  time  [26].  Finding  the  maximum  or  sum  of  M  values  and  storing 
it  in  some  processor  also  takes  +  y/N)  time  when  there  are  approximately  ^ 
inputs  per  processor.  Packing  and  monotone  routing  can  be  done  in  0{VN)  time 
using  greedy  routing  algorithm  [26]  since  there  is  at  most  one  packet  destined  for 
each  node.  Sorting  N  numbers  (with  one  in  each  processor)  takes  0{y/N  log  N)  time 
using  shearsort  [26].  We  need  to  modify  shearsort  as  follows.  First,  each  processor 
sorts  ^  numbers  sequentially.  This  takes  0(^log  ^)  time.  Then,  instead  of  a  step 
that  compeires  and  exchanges  two  numbers  we  use  a  step  that  merges  two  sorted 
sequences  and  splits  it  into  two  subsequences  of  equal  size.  Assuming  that  each 
processor  has  approximately  ^  numbers,  total  time  complexity  for  modified  shearsort 
is        log  ^  +  ^VN\og  TV).  This  is  equal  to  log  N).  The  detailed  descriptions 

of  the  above  parallel  algorithms  can  be  found  in  [26].  Now  we  analyze  the  time 
complexity  of  our  load  balancing  algorithm  as  follows. 

Procedure  TokenDistribution  takes  0{Mmax\/N)  time  derived  the  same  way  as 
for  a  hypercube.  Procedure  ComputeLoadDifFerence  and  SelectPackets  takes 
+  VN)  time.  The  complexity  of  procedure  AdjustDistribution  is  analyzed 
as  follows  :  step  1  takes  Ts{M,  N)  time  which  is  the  time  to  sort  M  items  on  N 
processors  with  approximately  equal  number  of  items  per  processor.  Sorting  on  a 
mesh  takes  0{:^\ogN)  time.  Step  2  takes  O(^)  time  and  each  of  the  steps  3,  4 
and  5  tcikes  +  y/N)  time;  steps  6  and  7  take  0{^)  time.  Hence  the  complexity 
of  AdjustDistribution  is  0{Ts{M,N)  +  %  +  y/N)  time. 

The  complexity  of  procedure  InformSourcel  is  derived  next.  Step  1  taJces 
0{Ts{M' ,  N))  time  where  M'  is  the  number  of  marked  packets;  note  that  the  num- 
ber of  marked  packets  per  processor  is  at  most  ^.  Ts{M' ,N)  =  0(-^\ogN).  Step 
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2  takes  0(f  +  y/N)  time  and  step  3  takes  O(^)  time.  Hence  the  complexity  of 
InformSourcel  is  0{Ts{M,N)  +  ^).  As  for  procedure  InformSource2,  step 
1  takes  the  same  time  as  for  InformSourcel.  The  number  of  iterations  of  step 

3  is  at  most  M^ax  +  f  =  0{Mmax)-  Hence  the  complexity  of  the  procedure  is 
0{Ts{M,N)  +  Mma.y/N). 

Now  we  are  ready  to  analyze  the  complexity  of  our  load  balancing  algorithm  for 
a  mesh.  Step  (a)  takes  O(M^ax)  time  and  step  (b)  takes  0(M„ax>/iV)  time.  Step 
(c)  takes  0{Ts{M,N)  +  f  +  VN)  time  for  each  of  log  iterations.  The  algorithm 
has  complexity  0{MmaxVN  +  ^^og^N  +  when  combining  packets  is  allowed 
and  0(M„,ax//V  +  ^log^iV  +  MmaxVN)  in  the  absence  of  combining. 

As  in  the  case  of  hypercube,  the  asymptotic  complexity  of  the  algorithm  is  seen 
to  be  the  same  in  the  presence  and  absence  of  combining  packets  in  step  (d),  because 
the  token  distribution  step  (step  (b))  dominates  in  complexity  over  step  (d). 

3.5.2    An  Example 

Here,  we  illustrate  through  an  example  how  our  parallel  load  balancing  algorithm 
described  in  the  previous  section  works.  The  packet  for  grid  a  is  represented  as 
<  Qa,ta,Ma,f{a)  >  where  Qa  and  /(a)  are  binary  numbers.  We  assume  that  there 
are  only  4  processors  for  simplicity.  The  initial  distribution  of  packets  in  our  example 
is  as  follows. 


So 


<  0,  2,1,00  > 

<  0,3,2,00  > 

<  0,0,1,00  > 

Po 


Pi 


<  1,0,4,10  > 

P2 


Si 

<  1,2,3,11  > 

<  1,3,2,11  > 

<  1,2,3,11  > 

<  1,0,3,11  > 

Pz 
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In  step  (b),  we  distribute  the  packets  evenly  among  the  processors  using  procedure 
TokenDistribution.  After  this  step  the  distribution  of  packets  becomes  as  follows. 

<  0,2,1,00  >      <  1,0,4,10  >       I    <  1,2,3,11  >      <  0,3,2,00  > 

<  1,3,2,11  >      <0,0,1,00>       I    <  1,2,3,11  >      <  1,0,3,11  > 

Po  Pi  I  P2  P3 

Now,  we  go  through  log  N  iterations,  where  we  call  procedure  AdjustDistribution 
in  each  iteration.  In  this  procedure,  packets  are  first  sorted  according  to  lexicograph- 
ical oidei  of  the  fields  <  Qa,ta,Ma  >  where  Qa  and  Ma  are  in  increasing  order  and 
ta  is  in  decreasing  order. 

So  I  Si 

<  0,3,2,00  >      <  0,0,1,00  >       I    <  1,2,3,11  >      <  1,0,3,11  > 

<  0,2,1,00  >      <  1,3,2,11  >       I    <  1,2,3,11  >      <  1,0,4,10  > 

Po  Pi  I  P2  ft 

The  procedure  ComputeLoadDifFerence  is  called  to  compute  A,-  which  is  the  half 
of  the  load  difference  between  segment  i  and  i  +  I.  In  our  example,  Lq  =  i  and 
Li  =  15.  Thus  Ao  =  =  5.5.  Some  of  the  grids  in  segment  1  have  to  migrate 

to  segment  0  to  balance  the  workload.  To  select  the  grids  to  migrate,  procedure 
SegmentedScan  is  called  on  M^s  of  packets  with  Qa  =  1.  The  resulting  M^^s  are 
shown  in  parenthesis  beside  the  packets. 

So  I  Sr 

<  0,3,2,00  >      <  0,0,1,00  >  I  <  1,2,3,11  >(3)  <  1,0,3,11  > 

<  0,2,1,00  >      <  1,3,2,11  >(2)  I  <  1,2,3,11  >(6)  <  1,0,4,10  > 

Po  Pi  I  Pi  P3 
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Now,  we  call  procedure  SelectPackets  to  select  and  mark  the  packets  which  need  to 
migrate  to  the  other  segment.  In  procedure  SelectPackets,  Mj^s  are  compared  to 
Ao  in  decreasing  order  of  it.  For  each  k,  packets  a  with  M]'=  <  Ao  are  marked  and  Aq 
is  updated  by  subtracting  Ma  values  of  those  packets  from  Aq.  For  example,  in  the 
first  iteration,  packet  <  1,3,2, 11  >  is  marked  and  Ao  is  updated  to  3.5  by  processor 
Pi.  In  the  next  iteration,  packet  <  1,2,3,11  >  is  marked  and  Aq  is  updated  to  0.5 
by  processor  Pj.  If  we  assume  that  e  =  0  then  no  more  packets  are  marked. 

So  I  5-1 

<  0,3,2,00  >        <  0,0,1,00  >       I    V<  1,2,3,11  >(3)  <  1,0,3,11  > 

<  0,2,1,00  >      y<  1,3,2,11  >(2)    I      <  1,2,3,11  >(6)  <  1,0,4,10  > 

Po  Pi  I  P2  P3 

We  update  Qa  field  of  each  marked  packet  by  complementing  the  least  significant  bit 
of  Qa-  We  also  decrement  ta  of  each  marked  packet  by  one.  Since  this  is  not  the  last 
iteration,  each  processor  updates  the  segment  number  of  each  packet  a  that  it  has, 
by  concatenating  the  most  significant  bit  of  f{a)  to  the  right  of  Qa- 

So  \  Si  \  S2  \  S3 

<  00,3,2,00  >      I  <  00,0,1,00  >      |<01,1,3,11>  |<11,0,3,11> 

<  00,2,1,00  >      |<01,2,2,11>      |<11,2,3,11>  |<10,0,4,1Q> 

Po  \  Pi  \  P2  \  Pz 

Now,  there  are  four  segments,  and  procedure  AdjustDistribution  is  called  again 
after  sorting  the  packets  eiccording  to  <  Qa^ta,  Ma  >  to  balance  the  workload  between 
two  segments.  Ecich  segment  consists  of  just  two  processors.  In  this  iteration,  no 
packets  are  marked,  and  the  packets  contain  the  following  information. 

<  00,3,2,00  >      |<  00,0,1,00  >      |<01,1,3,11>  |<11,2,3,11> 
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<  00,2,1,00  >      |<01,2,2,11>      |<10,0,4,10>  1<11,0,3,11> 

Po  \  Pi  \  P2  \  ^3 

Now,  Qa  indicates  the  destination  of  grid  a.  We  mark  the  packets  <  01, 2, 2, 11  >  and 

<  01, 1, 3, 11  >  since  their  Qa  fields  are  different  from  their  f{a)  fields.  These  packets 
are  sent  back  to  source  processors  by  calling  either  procedure  InformSourcel  or 
procedure  InformSource2  depending  upon  whether  the  packets  are  allowed  to  be 
combined  or  not.  Here,  processor  P3  receives  both  packets  and  migrates  the  cor- 
responding grids  to  processor  Pi.  After  this  is  completed,  grids  migrate  to  their 
destination  processors,  and  computation  starts. 

3.6    Experimental  Results 

We  performed  experiments  to  test  the  accuracy  of  solutions  provided  by  our  load 
balancing  algorithm.  In  these  experiments,  we  assumed  that  the  processors  initially 
have  certain  number  of  coarse  grids  and  the  number  of  child  (fine)  grids  created  in 
each  coarse  grid  region  has  an  exponential  distribution  with  a  mean  value  of  1.  We 
cissumed  that  the  sizes  of  these  fine  grids  which  are  a  measure  of  their  computational 
requirements,  are  independently  and  identically  distributed  random  variables  with 
a  uniform  distribution  in  the  range  [MINSIZE,MAXSIZE].  The  communication  re- 
quirement between  a  parent  grid  and  a  child  grid  is  assumed  to  be  proportional  to 
the  size  of  the  grid  with  the  ptirameter  COMMCOST  being  the  constant  of  propor- 
tionality. 

In  our  experiments,  we  went  through  200  generations,  and  accumulated  the  ob- 
jective function  values  which  were  computed  using  the  following  formula, 

Ti  =  max  I  J2  m&x(d(f{a),g{a))COMMCOST  •  M^) 

where  all  the  notations  were  defined  previously.  We  absorb  Ucomp  and  Ucomm  in  the 
parameter  COMMCOST.  Since  optimal  solutions  are  difficult  to  compute,  we  used 
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the  following  three  yardsticks  to  compare  our  algorithm  against  :  (a)  average  amount 
of  computation  per  processor  which  is  a  trivial  lower  bound  (referred  to  as  LB)  for 
any  solution,  (b)  solution  value  for  decreasing-fit  heuristic  and  (c)  solution  value  for 
list  scheduling  heuristic.  The  last  two  heuristics  are  used  in  multiprocessor  scheduling 
algorithms  [16].  In  decreasing-fit  heuristic,  the  grids  are  sorted  in  decreasing  order 
of  their  sizes  and  they  are  assigned  dynamically  to  idle  processors  in  that  order; 
communication  cost  is  ignored  here.  The  worst  case  behavior  of  this  scheduling 
algorithm  is  analyzed  in  [16]  as  follows. 

Wo  -  3  3N 

where  a;  and  Uo  are  the  maximum  execution  time  obtained  by  this  algorithm  and 
optimal  scheduling  respectively,  and  N  is  the  number  of  processors. 

In  the  list  scheduling  heuristic,  grids  are  arranged  in  decreasing  order  of  their  sizes 
but  with  communication  constraint  tags  (t'^s)  attached  as  in  our  heuristic.  Grids  are 
assigned  from  the  list  in  that  order,  but  a  grid  can  be  assigned  to  a  free  processor 
only  when  the  distance  of  the  free  processor  from  the  present  location  of  the  grid  does 
not  exceed  the  value  for  that  grid.  After  a  grid  is  assigned,  it  is  removed  from  the 
list.  To  find  the  next  grid  to  assign,  we  scan  the  list  again  from  its  beginning.  If  we 
cannot  find  any  grid  for  a  free  processor,  that  processor  is  kept  idle.  The  decreasing- 
fit  algorithm  has  0{M  log  M )  sequential  time  complexity  while  the  list  scheduling 
has  O(M^)  sequential  time  complexity,  where  M  is  the  number  of  grids. 

In  the  rest  of  this  section,  we  use  Li ,  and  L3  to  denote  respectively  our  load 
balancing  heuristic,  list  scheduling  and  decreasing-fit  heuristics.  We  plotted  the  so- 
lution values  obtained  by  the  three  heuristics  Li,  L2,  L3  and  lower  bound  LB  in  the 
graphs.  For  each  heuristic,  we  show  both  computational  and  communication  costs. 
They  correspond  to  the  first  and  second  terms  in  the  objective  function.  The  sum  of 
these  two  (total  cost)  is  also  shown  in  the  graphs. 
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First,  we  analyzed  the  sensitivity  of  the  solution  provided  by  our  heuristic  with 
respect  to  the  limit  for  communication  (namely  Cmax)-  Figure  3.3  shows  the  effect  of 
ta  values  (the  maximum  distance  grid  a  is  allowed  to  migrate)  on  the  solution  values; 
note  that  is  directly  related  to  C^ax-  In  the  figure,  t{avg)  is  the  maximum  distance 
that  a  grid  of  average  size  can  migrate.  The  grids  of  larger  sizes  can  only  migrate 
shorter  distances  while  the  grids  of  smaller  sizes  can  migrate  farther.  The  number  of 
processors  is  set  to  64,  and  MINSIZE  and  MAXSIZE  are  set  to  6  and  14  respectively 
in  this  experiment.  The  initial  number  of  grids  per  processor  is  set  to  5. 

The  t{avg)  value  which  provides  the  smallest  total  cost  tends  to  decrease  as 
COMMCOST  increases.  This  agrees  with  our  intuition  that  reducing  communi- 
cation cost  becomes  more  important  than  load  balancing  when  the  communication 
cost  is  high.  Regardless  of  COMMCOST  values,  the  total  cost  is  very  high  when 
t{avg)  is  only  one.  This  is  because  the  grids  cannot  migrate  fax  enough  to  achieve 
well  balanced  workload.  As  t(avg)  increases,  workload  becomes  better  balanced  at 
the  expense  of  increased  communication  cost.  After  the  execution  time  reaches  the 
minimum,  it  increases  slowly  cis  t{avg)  increases. 

Figure  3.3  shows  that  the  solution  values  are  not  very  sensitive  to  t{avg)  especially 
when  COMMCOST  is  small.  As  an  example,  when  COMMCOST  is  0.1,  the  difference 
in  total  costs  is  less  than  1%  of  the  minimum  total  cost  when  t{avg)  ranges  from  2 
to  8.  This  is  because  our  load  balancing  algorithm  does  not  migrate  grids  unless  it 
helps  balancing  workload  even  though  t{avg)  is  a  large  number.  This  fcict  makes  it 
easy  to  choose  right  t{avg)  for  good  performance. 

Figure  3.4  -  Figure  3.6  shows  the  performance  comparison  among  the  three  algo- 
rithms when  COMMCOST  changes  from  0.01  to  0.5.  The  number  of  processors  and 
the  initial  number  of  grids  per  processor  are  set  to  128  and  5  respectively.  MINSIZE 
and  MAXSIZE  are  set  to  4  and  16.  Figure  3.6  shows  that  Li  and      provide  good 
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Figure  3.3.  Total  Costs  for  Different  t{avg)  and  COMMCOST  values 
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performance  regardless  of  COMMCOST.  L3  provides  good  performance  only  when 
COMMCOST  is  smaller  than  0.1.  But,  since  Z/3  only  balances  the  workload  without 
trying  to  reduce  the  communication  cost,  it  gives  poor  performance  when  COMM- 
COST increases.  When  COMMCOST  is  smaller  than  or  equal  to  0.2,  Li  performs 
better  than  Li.  But,  as  COMMCOST  increases,  Li  outperforms  I3  slightly.  Fig- 
ure 3.4  and  Figure  3.5  show  that  when  COMMCOST  is  very  small,  L2  and  L3  provide 
well  balanced  workload  while  keeping  the  communication  cost  reasonably  low.  In  this 
case  L3  is  better  than  L2  because  of  its  lower  time  complexity.  When  COMMCOST 
is  larger  than  0.1,  Ii  is  the  best  among  the  three  algorithms.  This  result  shows  the 
importance  of  reducing  communication  cost  to  achieve  good  performance. 

Figure  3.7  -  Figure  3.9  show  the  solution  values  of  the  three  algorithms  for  differ- 
ent number  of  processors.  COMMCOST  is  set  to  0.2  and  all  the  other  parameters 
axe  the  same  as  before.  Regardless  of  the  number  of  processors,  L3  balances  the 
workload  best,  but  incurs  worst  communication  cost.  When  we  compare  Li  to  Zj, 
Z/2  balances  workload  slightly  better  than  L\.  But  Li  incurs  smaller  communication 
cost.  Figure  3.8  shows  that  communication  cost  of  L3  increases  more  rapidly  than 
those  of  L\  and  L2  as  the  number  of  processors  increases. 

Figure  3.10  -  Figure  3.12  show  the  solution  values  of  the  three  algorithms  for  dif- 
ferent grid  size  variances.  MINSIZE  and  MAXSIZE  are  set  to  10-VAR  and  lO-j-VAR 
respectively  in  this  experiment.  The  number  of  processors  is  set  to  128  and  all  the 
other  parameters  are  the  same  as  in  Figure  3.7  -  Figure  3.9.  Figure  3.10  shows  that 
large  variance  of  grid  sizes  slightly  affect  the  ability  of  balancing  the  workload  for  L\ 
and  Z/2.  But  L3  can  balance  the  workload  better  when  the  variaxice  of  grid  sizes  is 
large.  As  an  example,  when  VAR  is  10,  the  computation  cost  incurred  by  L3  is  very 
close  to  the  lower  bound  in  Figure  3.12.  The  communication  cost  obtained  by  L3 
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is  getting  larger  when  the  variance  of  grid  sizes  increases.  This  is  because  the  com- 
munication cost  is  proportional  to  grid  size  and  moving  a  large  grid  involves  large 
communication  cost.  L3  is  afFected  the  most  by  the  variance  of  grid  sizes  while  L2  is 
affected  the  least. 

Figure  3.13  -  Figure  3.15  show  the  solution  values  of  the  three  algorithms  for 
different  number  of  grids  per  processor  we  started  with.  All  the  other  parameters 
are  the  same  as  in  Figure  3.10  -  Figure  3.12.  These  figures  also  show  that  Li  and  L2 
provide  good  performance  when  the  number  of  grids  is  large  enough.  But,  when  the 
number  of  grids  is  small,  all  the  three  heuristics  provide  poor  performance  compared 
to  the  lower  bound. 

The  result  of  our  experiments  shows  that  our  load  balancing  cdgorithm  and  list 
scheduling  heuristic  provide  quality  solutions  to  the  problem  which  is  shown  to  be 
intractable.  In  addition,  our  load  balancing  algorithm  can  be  efficiently  parallelized 
on  distributed  memory  multiprocessor  systems.  Decreasing-fit  heuristic  provides  near 
optimal  solutions  when  communication  cost  is  very  low.  It  will  be  very  useful  when 
tasks  are  almost  independent  with  little  or  no  communication  among  them.  List 
scheduling  heuristic  provides  good  quality  solutions  in  most  of  the  cases.  But  O(M^) 
sequential  time  complexity  make  it  expensive  to  run  when  M  (the  number  of  grids) 
is  a  large  number.  Since  it  is  a  greedy  heuristic,  it  cannot  be  easily  peirallelized  on 
distributed  memory  multiprocessor  systems. 
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Figure  3.4.  Computation  Costs  for  Different  COMMCOST  values 
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Figure  3.5.  Communication  Costs  for  Different  COMMCOST  values 
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Figure  3.7.  Computation  Costs  for  Different  Numbers  of  Processors 
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Figure  3.8.  Communication  Costs  for  Different  Numbers  of  Processors 
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Figure  3.9.  Total  Costs  for  Different  Numbers  of  Processors 
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Figure  3.10.  Computation  Costs  for  Different  VAR  values 
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Figure  3.11.  Communication  Costs  for  Different  VAR  values 
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Figure  3.13.  Computation  Costs  for  Different  Numbers  of  Grids  per  Processor 
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Figure  3.14.  Communication  Costs  for  Different  Numbers  of  Grids  per  Processor 
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CHAPTER  4 
DOMAIN  DECOMPOSITION 

In  this  chapter,  we  discuss  the  second  approach  which  uses  domain  decompo- 
sition to  assign  computation  to  processors.  In  this  dissertation,  we  do  not  focus  on 
this  approach  due  to  its  inefficiency,  but  we  discuss  it  only  for  completeness.  Hence, 
we  discuss  it  briefly  without  attempting  to  solve  the  problem.  Here,  we  divide  the 
computational  domain  into  N  (number  of  processors)  regions  and  assign  each  region 
to  a  processor.  The  grids  at  all  levels  within  a  region  are  assigned  to  the  same  pro- 
cessor. If  a  grid  at  any  level  spans  more  than  one  region,  it  is  divided  into  pieces 
and  each  piece  is  assigned  to  a  processor.  Figure  4.1  shows  an  example  in  which 
one-dimensional  grids  are  assigned  to  a  linear  array  of  processors.  In  the  figure, 
the  computational  domain  was  decomposed  such  that  the  computationed  workload  of 
level  1  grids  is  balanced  among  the  processors. 


PO 

PI 

P2 

P3 

Figure  4.1.  An  Assignment  of  One-dimensional  Grids 

Since  a  grid  can  be  divided  into  pieces  and  assigned  to  more  than  one  processor, 
intra-grid  communication  is  necessary  among  different  processors.  But,  by  always 
assigning  a  grid  to  a  set  of  contiguous  processors,  we  can  often  limit  the  intra-grid 
communication  to  neighboring  processors.    Since  the  grids  (or  pieces  of  grids)  at 

76 


77 


all  levels  in  a  region  are  assigned  to  a  processor,  inter-grid  communication  is  done 
within  a  processor  in  this  approach.  When  the  next  level  grids  are  generated,  the 
current  assignment  of  grids  may  not  balance  the  workload  properly.  The  domain 
should  be  redecomposed  according  to  the  new  workload.  Then,  the  grids  at  all  levels 
are  reassigned  according  to  the  new  decomposition  of  the  domain.  This  results  in 
moving  solutions  of  lower  level  grids  from  one  processor  to  another.  Figure  4.2  shows 
a  redecomposition  of  the  domain  after  the  generation  of  a  grid  at  level  2  in  Figure  4.1. 
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P3 

Figure  4.2.  Redecomposition  of  the  Domain  After  a  Level  2  Grid  is  Generated 

Since  each  decomposition  of  the  domain  involves  reassigning  the  solutions  of  all 
the  lower  level  grids,  it  is  desirable  to  perform  decomposition  less  frequently.  As  in 
the  previous  approach,  we  synchronize  the  computation  at  each  level  of  grids.  Each 
time  the  next  level  grids  are  generated,  redecomposition  of  computational  domciin 
is  performed.  When  a  coarse  grid  at  level  0  is  imposed  on  the  entire  domain,  the 
workload  is  uniformly  distributed  in  the  domain.  Then  it  is  easy  to  decompose  the 
domain  into  N  regions  so  that  each  region  has  the  equal  amount  of  workload.  But, 
after  fine  grids  are  imposed  on  many  different  places  in  the  domain,  the  workload 
distribution  is  not  uniform  any  more.  Redecomposing  the  computational  domain  also 
involves  moving  solutions  of  all  the  lower  level  grids.  Therefore,  it  is  not  trivial  to 
find  an  optimal  redecomposition  of  computational  domain  so  as  to  minimize  the  cost 
of  obtaining  solutions  on  the  next  level  grids  and  moving  the  solutions  of  all  the  lower 
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level  grids.  In  general,  the  redecomposition  should  be  done  in  such  a  way  that  after 
assigning  regions  to  processors  the  following  quantity  is  minimized. 

Ti  =  max  luaomj.  E       +         ^  E  '^(P'         I  +  ^ 
"    \        jeSp  }€Sp  / 

where 


Sp         —  set  of  grids  and  subgrids  assigned  to  processor  p;  this  is  defined  by 

the  new  decomposition 
Mj        —  number  of  grid  points  in  grid  or  subgrid  j  6  Sp 
B^^        —  number  of  grid  points  on  the  boundary  of  subgrid  ;  G  Sp  shared  by 

einother  subgrid  assigned  to  processor  q  {jp  ^  q) 
Ucomp     —  computational  cost  per  grid  point 

Ucomm  —  cost  of  transmitting  a  grid  point  value  between  neighboring  processors 
F  —  frequency  of  intra-grid  communication  during  level  i  computation 

d{p,  q)    —  communication  distance  between  processors  p  and  q 
^  —  cost  of  redecomposition  including  moving  solutions  of  all  the  lower 

level  grids 


The  first  term  within  parenthesis  represents  the  computation  cost  to  solve  the 
problem  on  the  grids  assigned  to  processor  p.  The  costs  of  error  estimation,  clus- 
tering and  generating  the  next  level  grids  are  also  included.  Since  a  grid  can  be 
assigned  to  more  than  one  processor,  we  assume  that  the  clustering  algorithm  can  be 
efficiently  parallelized.  The  second  term  represents  the  intra-grid  communication  cost 
to  obtain  the  neighboring  grid  point  values.  Intra-grid  communication  cost  between 
two  subgrids  assigned  to  different  processors  p,  q  is  assumed  to  be  proportional  to  the 
product  of  the  number  of  grid  points  on  the  boundary  shared  by  two  subgrids  and  the 
distance  between  the  two  processors.  ^  indicates  the  cost  of  moving  solutions  of  all 
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the  lower  level  grids  according  to  the  new  decomposition  of  computational  domain. 
As  we  move  grids  defined  at  higher  levels  of  refinement,  ^  becomes  more  expensive, 
depends  on  the  new  decomposition  of  the  domain. 
Finding  the  optimal  partitioning  of  the  domain  which  minimizes  the  above  T,  is 
very  difficult  for  two-dimensional  grids.  If  we  ignore  communication  cost  and  only 
balance  the  computational  workload,  binary  decomposition  [7]  can  be  used  for  parti- 
tioning the  domain.  The  resulting  regions  can  be  easily  assigned  to  two-dimensional 
mesh  or  hypercube  multiprocessors  [7].  Binary  decomposition  is  an  expensive  opera- 
tion because  it  involves  collecting  exact  load  distribution  and  redrawing  boundaries 
between  the  regions.  If  the  resulting  regions  were  assigned  to  hypercube  multipro- 
cessors, the  next  application  of  binary  decomposition  can  be  done  more  efficiently  by 
collecting  the  workload  of  each  region  in  a  tree-like  fashion.  Figure  4.3  shows  how 
this  can  be  done.  Initially,  each  processor  computes  the  new  workload  of  the  region 
assigned  to  it.  Then,  it  sends  the  amount  of  workload  to  its  parent  node.  The  parent 
node  adds  the  two  workloads  received  from  its  child  nodes  and  send  the  result  to  the 
grandparent  node.  This  procedure  continues  until  the  root  of  the  tree  has  the  work- 
load of  both  left  and  right  half  of  the  domain.  In  the  next  phase,  the  root  node  makes 
the  first  vertical  cut  through  the  domain  using  the  above  information.  The  result  of 
this  cut  is  sent  to  the  two  child  nodes  in  the  next  level  of  the  tree.  Then  the  two 
child  nodes  make  horizontal  cuts  through  left  and  right  half  of  the  domain  simultane- 
ously using  the  information  received  from  their  parent  and  two  grandchildren  nodes. 
This  procedure  continues  until  y  processors  make  last  level  cuts  in  parallel.  This 
method  is  similar  to  dimensional  exchange  of  workload  in  hypercube  multiprocessor 
systems  [33]. 

The  advantage  of  the  domain  decomposition  approach  is  that  inter-grid  communi- 
cation can  be  accomplished  without  inter-processor  communication.  This  is  possible 


(a)  Phase  1 


(b)  Phase  2 

Figure  4.3.  Computation  of  Workload  for  Binary  Decomposition 
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through  moving  the  solutions  of  all  the  lower  level  grids  eeich  time  redecomposition 
is  done.  In  the  grid  indivisibility  approach,  the  solutions  of  level  i  grids  are  always 
sent  to  the  processors  where  the  solutions  of  level  i  -  1  grids  are  stored.  But,  in  this 
approach,  the  solutions  of  lower  level  grids  can  be  sent  to  the  processors  where  the 
computation  on  new  grids  is  performed. 

If  we  consider  the  new  grids  and  the  solutions  of  lower  level  grids  as  jobs  and 
resources  respectively,  this  problem  becomes  similar  to  the  load  balancing  problem 
with  resource  migration  formulated  by  Varadarajan  and  Ma  [31].  In  their  load  bal- 
ancing problem,  resources  are  allowed  to  migrate  to  new  nodes  so  that  jobs  can  be 
sent  to  the  nodes  with  resources  cind  processed  within  a  given  desired  response  time. 
But  in  our  problem,  the  cost  of  moving  the  solutions  of  all  lower  level  grids  may  be 
very  expensive,  especially  at  the  highest  level  of  adaptive  mesh  refinement.  When 
the  computation  on  grids  at  level  i  is  performed,  we  only  need  the  solutions  of  grids 
at  level  i  —  1  for  interpolating  boundary  values  and  projecting  new  solutions.  Hence, 
moving  the  solutions  of  all  the  lower  level  grids  causes  unnecessary  inter-processor 
communication.  For  this  reeison,  we  do  not  pursue  any  solutions  to  the  load  balancing 
problem  using  this  approach. 


CHAPTER  5 
GRID  PARTITIONING  APPROACH 

In  this  chapter,  we  discuss  the  third  approach  for  solving  adaptive  mesh 
refinement  problems  using  multiprocessor  systems.  In  contrast  to  the  first  approach 
where  the  number  of  grids  should  be  greater  than  the  number  of  processors  for  efficient 
assignment,  it  is  necessary  to  have  larger  number  of  processors  than  grids  in  this 
approach.  First,  we  will  describe  the  approach  in  detail  and  explain  the  difference 
between  the  first  two  approaches.  We  will  formulate  the  problem  for  two-dimensional 
adaptive  meshes,  and  discuss  how  we  solve  the  load  balancing  problem  using  the  grid 
ptirtitioning  approa<;h. 

5.1    The  Approach 

As  in  the  first  and  second  approaches,  we  synchronize  the  grid  assignment  after 
generating  all  the  grids  at  level  i  in  this  approach.  But,  unlike  the  first  approach,  a 
grid  CcUi  be  divided  into  pieces  (subgrids)  and  assigned  to  more  than  one  processor. 
This  property  is  shared  by  the  second  approach.  However,  in  the  third  approach, 
we  keep  the  solutions  of  coarser  level  grids  in  the  same  processors  where  they  were 
generated  and  we  allocate  a  set  of  processors  for  each  newly  generated  fine  grid. 
Since  a  newly  generated  grid  can  be  assigned  to  a  set  of  processors  different  from 
those  where  it  was  generated,  inter-grid  communication  occurs  between  the  two  sets 
of  processors.  Intra-grid  communication  also  occurs  among  a  set  of  processors  because 
a  single  grid  can  be  assigned  to  more  than  one  processor. 

Figure  5.1  shows  an  assignment  of  one-dimensional  grids  at  different  levels.  The 
lines  in  the  figure  represent  the  grids  at  level  0  or  1.  The  coarse  grid  at  level  0  was 
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Figure  5.1.  An  Assignment  of  One-Dimensional  Grids 

uniformly  divided  and  assigned  to  four  processors.  The  first  fine  grid  at  level  1  was 
assigned  to  a  set  of  processors,  that  is,  Pq,  Pi,P2  and  the  second  one  was  assigned  to 
P3.  One-dimensional  grids  can  be  easily  mapped  onto  a  linear  array  of  processors  so 
that  intra-grid  communication  is  limited  to  neighboring  processors.  Here  the  problem 
is  to  distribute  the  fine  grids  among  the  processors  so  as  to  balance  the  workload  as 
well  as  to  minimize  the  communication  costs  necessary  for  sending  the  interpolated 
boundary  Veilues  and  projecting  the  fine  grid  solutions  back  onto  coarse  grids. 

Figure  5.2  shows  the  grid  structure  after  a  level  2  grid  is  generated  in  Figure  5.1 
and  two  different  assignments  of  that  grid.  In  (a)  the  grid  at  level  2  was  assigned 
to  Pq  and  Pi.  This  assignment  incurs  no  inter-grid  communication  cost  but  high 
computation  cost  because  Pj  and  P3  are  idle  during  the  level  2  computation.  Note 
that  the  computational  workload  cissociated  with  the  fine  grid  at  level  2  may  be 
significant  though  the  domtiin  my  be  small  since  finer  mesh  spjicing  is  used  at  level 
2.  Another  possible  assignment  is  described  in  (b),  where  the  grid  at  level  2  was 
assigned  to  Pq,  Pi,  P2  and  P3.  In  this  assignment,  a  portion  of  grid  generated  by  Pi 
is  processed  by  P2  eind  P3.  This  implies  that  communication  is  necessary  between  Pi 
and  P2  and  Pj  and  P3.  But  the  computation  time  is  reduced  because  a  smaller  piece  of 
the  grid  is  assigned  to  each  processor.  This  problem  becomes  more  complicated  when 
there  are  more  than  one  fine  grid  and  when  a  two-dimensional  domain  is  involved. 
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Figure  5.2.  Two  Different  Assignments  of  a  Level  2  Grid 
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5.2    Problem  Formulation 


In  this  section,  we  consider  the  grid  assignment  for  a  two-dimensional  adaptive 
mesh  algorithm.  When  we  formulate  our  load  balancing  problem,  we  assume  that  all 
two-dimensional  grids  are  rectangular.  We  also  assume  that  target  machines  are  two- 
dimensional  mesh  multiprocessors.  Later  in  this  chapter,  we  will  consider  hypercube 
multiprocessors. 

Let  m  be  the  number  of  grids  at  level  i.  If  we  allocate  Xj  x  Yj  submesh  for  each 
grid  j  close  to  the  processors  where  grid  ;  was  generated,  the  submesh  of  one  grid 
can  overlap  with  submeshes  of  other  grids  especially  when  grids  are  closely  located 
to  each  other.  Then,  the  amount  of  workload  for  the  overlapping  processors  will  be 
twice  as  much  as  those  of  other  processors.  To  avoid  this  workload  imbalance,  some  of 
the  grids  may  have  to  be  assigned  to  processors  which  are  distant  from  the  processors 
where  they  were  generated.  These  eissignments  will  incur  inter-grid  communication 
at  some  distance  which  can  not  be  expressed  in  terms  of  Xj  and  Yj.  To  define  the 
problem  more  formally,  we  introduce  the  following  notations  : 

G  —  iy,  E)  —  processor  graph  where  V  is  the  set  of  processor  nodes  and 


E  the  set  of  communication  links 


S 


set  of  fine  grids  generated  at  level  i 


f{a)  is  the  neighborhood  of  processors  where  grid  a  was  generated, 
processors  (subset  of  /(a))  having  boundary  values  for  grid  a 


/'(«) 


before  load  balancing 


comp 


computational  cost  per  mesh  point;  depends  on  the  finite  difference 


method  used 


comm 


cost  of  transmitting  a  mesh  point  value  between  neighboring 


processors 


communication  distance  between  processors  pi  and  pj  in  G 


86 


d'{Vi,V2)    —  maXpj6Vi,p2eV2  d{pi,p2) 

F  —  frequency  of  intra-grid  communication  during  level  i  computation 

g  :  S  —^2^  —  grid  assignment  function  to  be  determined  by  load  balancing 
g'{a)  —  processors  (subset  of  g{a))  having  boundary  values  for  grid  a 

after  load  balancing 
Sg  —  set  of  subgrids  created  by  assignment  g 

Mh  —  number  of  mesh  points  in  the  subgrid  h 

Bh  —  number  of  mesh  points  on  the  boundary  of  subgrid  b 

h  -.V  ^  Sg  —  h{p)\s  the  subgrid  assigned  to  processor  p 

After  the  partitioning  and  assignment  of  grids  at  level  t,  the  computational  workload 
Ep  for  processor  p  is  given  by  UcompMh(p).  Note  that  we  require  that  no  more  than 
one  subgrid  be  assigned  to  a  processor.  For  a  fine  grid  a,  inter-grid  communica- 
tion cost  Ci{a)  is  estimated  to  be  Ucomm  {d'{f'{a),g'{a))B'{a)  +  d'{f{a),g{a))M'{a)) 
where  B'{a)  =  „^(|(/>f4|,|,,(a)|)  and  M'{a)  =  ^(|(;|;^|,|,(,)|);  the  first  term  is  the  cost 
of  migrating  the  boundary  values  for  fine  grid  a  that  occurs  before  the  computation  of 
mesh  point  values  in  a  and  the  second  term  is  the  cost  of  sending  the  fine  grid  values 
back  to  the  parent  grid  for  updating.  Whenever  a  grid  a  is  allocated  to  more  than  one 
processor,  for  each  of  its  subgrids  6,  intra-grid  communication  is  needed  whose  cost 
6*2(6)  is  estimated  to  be  UcommFBh  where  B\,  is  the  number  of  mesh  points  (interior 
to  grid  a)  on  the  boundary  of  6.  Our  problem  is  then  formally  stated  as  follows  : 

Find  g,  Sg  and  hence  /i  so  as  to  minimize  maxpgv  {Ep  +  C2(/i(p)))+maXag5  Ci(o) 
with  the  restriction  that  the  grids  are  assigned  to  disjoint  sets  of  processors, 
i.e.  for  any  two  grids  oi  and  oj,  g{ai)  D  5^(02)  =  0. 

As  was  stated  before,  we  restrict  our  attention  to  the  case  where  each  fine  grid  aj 
(1  <  J  <  m)  is  rectangular  in  shape  with  dimensions  Wj  x  hj  and  the  processor  graph 
G  is  a  mesh  network  with  P  rows  and  Q  columns  (assuming  P  >  Q  without  loss  of 
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generality).  Since  there  is  at  most  one  subgrid  assigned  to  a  processor,  m  <  PQ.  The 
problem  is  then  posed  as  follows  : 

Determine  for  each  grid  Cj,  dimensions  Xj  x  Yj  of  the  processor  sub- 
mesh  to  be  allocated  to  the  grid  and  /j,  the  index  of  the  processor  in  the 
south-west  corner  of  the  submesh  so  as  to  minimize  maxi <j<m(C4omp^^  + 
2UcommF{^  +  ^))  +  max^j^sCi{aj).  In  the  rest  of  this  chapter,  ^  denotes  when 
we  discuss  the  formulation  of  the  problem. 

For  the  simplicity  of  the  problem,  we  ignore  inter-grid  communication  cost  and 
seek  solutions  which  minimize  computation  cost  and  intra-grid  communication  cost. 
We  will  discuss  the  extension  of  the  approach  to  include  inter-grid  communication 
cost  at  the  end  of  this  chapter.  If  we  remove  the  term  representing  inter-grid  com- 
munication cost  from  the  problem  formulation,  our  objective  function  decreases  as 
Xj  and  Yj  increase.  Consequently,  it  is  advantageous  to  assign  a  submesh  which  is  as 
large  as  possible  for  each  grid  Cj.  It  is  also  necessary  that  be  equal  to  for 
all  pairs  of  j  and  k  to  balance  the  workload  among  the  processors.  From  the  above 
discussion,  we  have  the  following  two  desirable  properties  of  processor  allocation. 

1.  ET=i  XjYj  =  N, 

When  the  total  number  of  grid  points  at  level  i  is  larger  than  the  number  of  processors, 
the  number  of  processors  XjYj  for  each  grid  j  is  limited  by  the  above  two  properties. 
Let  Nj  =  XjYj.  Then  the  intra-grid  communication  cost  for  grid  aj  is  as  follows. 

Then, 
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becomes  0  when  Xj  =  yj^.  Since  =  ^,Yj  =  ^j^.  Then  it  follows 
that  ^  =  ^.  To  minimize  the  cost  of  intra-grid  communication  separately  for  each 
grid  (for  a  fixed  number  of  processors  assigned  to  the  grid),  we  add  the  third  property 
of  processor  allocation  as  follows. 


Now,  our  problem  is  approximated  to  finding  m  submeshes  and  their  locations,  one 
for  each  grid,  which  satisfy  the  above  three  properties.  This  problem  is  formally 
stated  as  follows  : 

For  each  grid  Cj  of  dimension  wj  x  hj,  find  Xj  x  Yj  processor  submesh  to 
be  allocated  to  the  grid  and  Ij,  the  index  of  the  processor  in  the  south-west 
corner  of  the  submesh  so  as  to  maximize        ^j^j  with  the  restriction  that 

Vj,Jk  ^  =  and  V;  ^  =  ^.  We  will  call  this  problem  submesh  allocation 
problem. 

5.3    Processor  Allocation  Using  Two-Dimensional  Packing 

In  this  section,  we  propose  an  efficient  approach  to  find  an  approximate  solution 
to  the  problem  formulated  in  the  previous  section.  This  approach  views  the  problem 
as  a  two-dimensioned  packing  problem  wherein  rectangles  need  to  be  packed  in  a  bin 
of  infinite  width  and  height.  We  present  an  approximate  heuristic  for  this  packing 
problem  and  show  how  the  approximate  solution  can  be  used  to  allocate  processor 
submeshes  to  the  grids.  Experimental  results  on  the  performance  of  the  packing 
algorithm  are  included  at  the  end  of  this  section. 

5.3.1    A  Packing  Algorithm 

Two  dimensional  packing  problem  heis  been  studied  by  many  researchers  [5,  4,  6, 
18].  It  arises  in  a  variety  of  situations  such  as  scheduling  of  tasks  and  cutting-stock 
problems.  Cutting-stock  problems  may  involve  cutting  objects  out  of  a  sheet  or  roll 
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of  material  so  as  to  minimize  waste.  The  scheduling  of  teisks  with  a  shared  resource 
involves  two  dimensions,  the  resource  and  time,  and  the  problem  is  to  schedule  the 
tasks  so  as  to  minimize  the  total  amount  of  time  used.  Memory  is  a  typical  exiimple 
of  shared  resources.  In  general,  the  problem  is  stated  eis  follows  : 

Given  a  rectangular  bin  with  fixed  width  and  infinite  height,  pack  a 
finite  set  of  rectangles  of  specified  dimensions  into  the  bin  in  such  a  way 
that  the  rectangles  do  not  overlap  and  the  total  bin  height  used  in  the 
packing  is  minimized.  The  rectangles  should  be  packed  with  their  sides  parallel 
to  the  sides  of  the  bin.  In  one  version  of  the  problem  (see  [5])  the  rectangles  should 
be  packed  in  a  fixed  orientation. 

In  our  processor  allocation  problem,  grids  correspond  to  rectangles  to  be  packed. 
But,  the  problem  is  slightly  different  from  the  above  two-dimensional  bin  packing 
problem.  First,  the  width  as  well  as  the  height  of  the  bin  is  unlimited.  Assume  that 
we  are  given  a  y/N  x  y/N  mesh  multiprocessors  for  m  grids.  The  goal  of  our  packing 
is  minimizing  the  maximum  of  width  and  height  of  bin  used  for  packing  m  grids 
starting  from  the  left-bottom  comer  of  the  bin.  Second,  the  orientation  of  grids  is 
not  important,  so  grids  can  be  rotated  when  they  are  packed.  Figure  5.3  shows  an 
example  of  such  packing. 

After  packing  m  grids  as  in  Figure  5.3,  we  use  the  two  ratios,  and  to 
allocate  a  submesh  for  each  grid.  The  detailed  allocation  method  will  be  described 
later  in  this  section.  First,  we  will  show  that  decision  version  of  our  2D  packing 
problem  is  NP-complete  in  the  following  theorem. 

Theorem  5.3.1  The  following  problem  is  NP-complete  : 

Given  m  rectangles  of  dimension  Wi  x  hi  (iw,-  >  hi),  is  there  a  packing  so  that  the 
maximum  of  height  and  width  of  packing  is  smaller  than  or  equal  to  L? 


I 
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Figure  5.3.  An  Example  of  Packing 

Proof  :  We  will  reduce  the  bin  packing  problem  which  is  known  to  be  NP-complete 
[24]  to  our  2D  packing  problem.  The  bin  packing  problem  is  stated  as  follows:  Given 
k  bins  of  capacity  J5,  and  a  finite  set  f/  =  {ui,  U2, . . . ,  u„,}  of  items  with  size 
of  each  item  s{ui)  €  Z"*"  (s(u,)  <  5),  is  there  a  partition  of  U  into  disjoint  sets 
Ui,U2,...,Uk  such  that  the  sum  of  sizes  of  the  items  in  each  t/,  is  no  more 
than  B?  We  can  construct  an  instance  of  the  2D  packing  problem  for  an  instance 
of  the  bin  packing  problem  in  the  following  way  :  Create  m  +  1  rectangles  with  the 
following  dimensions;  wq  =  k{B  +  I),  ho  =  k{B  +  1)  —  B,  wi  =  B  +  l,hi  =  s{ui), 
W2  =  B  +  l,h2  =  s{u2),  •  •  Wm  =  B  +  l,hm  =  s(um)  and  L  =  k{B  +  1).  Then 
any  fecisible  packing  of  these  rectangles  must  be  in  the  form  as  in  Figure  5.4.  The 
largest  rectangle  in  the  figure  is  the  first  rectangle  in  the  list.  Since  one  dimension  of 
the  rectangle  is  already  k{B  +  1),  other  rectangles  cannot  be  packed  above  it.  The 
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k(B+l)-B  B 
Figure  5.4.  Feasible  Packing  of  m+1  Rectangles 

larger  of  the  two  dimensions  is  5  +  1  for  all  but  the  first  rectangle,  hence  they  can 
only  be  packed  so  that  their  longer  sides  are  parallel  to  the  F-axis.  For  this  instance 
of  the  2D  packing  problem  to  have  a  solution,  rectangles  should  be  packed  so  that 
the  width  of  the  packing  does  not  exceed  k{B  +  1).  Note  that  each  level  of  packing 
of  rectangles  indexed  from  1  to  n  (for  example,  rectangles  "a" ,  "b"  form  one  level  of 
packing  in  Figure  5.4)  corresponds  to  a  disjoint  set  C/,-  in  the  bin  packing  problem. 
Now,  it  is  obvious  that  there  is  a  solution  for  an  instance  of  the  bin  packing  problem 
if  and  only  if  there  is  a  solution  for  the  corresponding  instance  of  the  2D  packing 
problem.  ■ 


Since  the  decision  version  of  the  problem  is  NP-complete,  the  original  optimization 
problem  is  NP-hard.  For  this  reason,  we  seek  an  approximation  algorithm  to  solve 
the  problem. 
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Since  our  packing  problem  is  somewhat  different  from  the  commonly  known  2D 
bin  packing  problem,  the  heuristic  algorithms  developed  in  [5,  4,  6,  18]  are  not  readily 
useful  for  our  purpose.  One  approach  to  solving  our  problem  is  to  try  different  bin 
widths  and  for  each  width,  use  one  of  the  above  packing  heuristics  to  see  if  the  height 
of  the  packing  does  not  exceed  the  desired  height  for  that  width.  We  can  find  the 
minimum  width  that  allows  the  desired  packing,  by  using  some  form  of  interpolation 
search  on  the  interval  between  minimum  and  maximum  widths  possible.  A  better 
approach,  that  avoids  repacking,  is  to  use  a  packing  heuristic  that  attempts  to  satisfy 
two  goals,  namely,  packing  as  tightly  as  possible  and  making  the  width/height  ratio 
for  the  packing  closely  match  the  ratio  of  the  processor  mesh.  To  this  end,  we  propose 
a  heuristic  which  we  will  call  "tight-pack"  (TP,  for  short)  heuristic. 

The  basic  idea  of  TP-heuristic  is  as  follows.  First  the  grids  are  sorted  in  some 
selected  order.  Then  we  start  packing  grids  one  by  one  at  the  south-west  corner 
of  the  bin.  Let's  consider  the  space  of  the  bin  as  the  first  quadrant  of  X-Y  plane. 
Then  the  south-west  corner  of  the  bin  becomes  the  origin  of  the  coordinate  system, 
that  is  (0,0).  Each  packed  grid  has  4  corners  NW,NE,SE  and  SW  with  respect  to 
its  orientation  in  the  packing.  A  NW  or  SE  corner  of  a  packed  grid  is  called  a  free 
corner  (FC)  if  no  other  item  occupies  that  corner  such  that  the  item  is  above  and 
to  the  right  of  that  corner.  In  our  algorithm,  only  free  comers  axe  considered  for 
packing  the  new  grid  and  when  a  new  grid  is  placed  in  a  free  corner,  it  is  placed  so 
that  it  is  above  and  to  the  right  of  the  corner.  After  packing  the  first  grid  at  the 
origin,  the  next  grid  is  packed  at  one  of  the  two  corners  created  by  packing  the  first 
grid.  Figure  5.5  (a)  shows  the  two  corners,  "a"  and  "b".  In  general,  after  packing 
Wi  X  hi  grid  at  {xj,yj),  we  add  two  more  free  corners  located  at  {xj,yj  +  hi)  and 
(xj  -|-  Wi,  yj)  to  the  list  of  free  corners.  We  also  keep  the  majcimum  size  of  the  grid 
which  can  be  packed  in  that  free  corner  along  with  its  location.  We  call  it  the  "size" 
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Figure  5.5.  The  Corners  for  Packing  the  Next  Grid 


of  free  corner.  Figure  5.5  (b)  shows  two  new  corners,  "c"  and  "d",  after  packing  the 
second  grid  at  corner  "b".  After  the  second  grid  is  packed,  the  width  of  the  grid  which 
can  be  packed  at  corner  "a"  is  limited  to  the  width  of  the  first  grid.  Hence,  we  need 
to  update  that  information  whenever  we  pack  a  grid.  Let  Wi  and  Hi  be  the  width 
and  height  of  the  space  used  in  packing  after  packing  the  first  i  grids.  We  choose  the 
corner  for  the  i  +  1-st  grid,  so  that  the  maximum  of  Wi+i  and  is  minimized.  // 
we  are  given  a  processor  mesh  P  x  Q  with  P  >  Q  and  ^  =  R,  we  choose  the  corner 
for  the  i  +  1-st  grid,  so  that  the  maximum  of  Wi+i  and  RHi+i  is  minimized.  The 
detailed  description  of  our  packing  algorithm  is  given  bellow.  Assume  that  we  are 
given  the  mesh  ratio  R  and  m  grids,  lUi  x  Ai,  1^2  x  ^2,  •  ■  •  j  'Wm  x  hm-  After  the  following 
algorithm  terminates,  global  Vciriables,  XLoci  and  YLoci,  contain  the  location  of  the 
corner  where  grid  i  was  packed.  Orienti  is  set  to  1  if  grid  i  was  rotated  and  it  is  set 
to  0  otherwise. 

Algorithm  Packing; 

1.  width  =  0 

2.  height  =  0 

3.  Initialize  the  three  lists  Corner  List,  XScanList  and  YScanList  to  be  empty. 

4.  Insert  [(0, 0),  (00, 00)]  into  Corner  List. 
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//  [(i,  y),  (p,  q)]  represents  the  corner  located  at  (z,  y)  and  the  maximum  grid 
size  that  can  be  packed  is  p  x  q.  // 

5.  for  each  u;,-  x  hi  grid  do 

begin 

6.  FindBestCorner((u;,, 

//In  procedure  FindBestCorner,  each  corner  in  Corner  List  is 
examined  and  a  corner  [{xk,yk),  {Pk,qk)]  is  chosen  which  minimizes  the 
maximum  of  width  and  height  *  R  after  iw,  x  hi  grid  is  packed  // 

7.  UpdateCornerList((u;,, 

//In  procedure  FindBestCorner,  the  size  of  each  corner  in 
CornerList  is  updated  after  Wi  x  hi  grid  is  packed  at 
corner  [{xk,yk)APk,<lk)]  II 

8.  FindNewCornerSize((t/;,-,  hi)) 

II  After  Wi  X  hi  grid  is  packed  at  corner  [(xfc,  yjt),  (pfc,        two  new 
corners,  [{xk,  yk  +  hi),  {pk',qk')]  and  [{xk  +  lu.-,  yk),  {pk",qk")]  are  created. 
In  procedure  FindNewCornerSize,  the  size  of  each  corner  is 
determined.  // 

end 

end  Packing; 

Procedure  FindBestCorner((u;,,  hi),  k); 

1.  mindim  =  oo 

2.  for  each  element  [{xj,yj),(pj,qj)]  in  CornerList  do 

begin 

3.  mxl  =  oo,    myl  =  oo,    mx2  =  oo,    my2  =  oo 

/ /  Try  to  pack  a  grid  (w,-  x  hi)  at  a  free  corner  {pj,qj)  in  both 
orientations.  // 
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4.  if  (pj  -  Wi)  >  0  and  {qj  -  hi)  >  0  then 

5.  mil  =  max{width,  Xj  +  iw.),    mj/1  =  maiK{height,yj  +  hi) 

6.  else  if  {pj  -  k)  >  0  and  {qj  -Wi)>0  then 

7.  mz2  =  max(iyiJ</i,  Xj  +  hi),    my2  =  m&x{height,  y,  +  Wi) 

8.  else  goto  the  next  iteration 

9.  ml  =  ma.x{mxl,R*  myl) 

10.  m2  =  max(mx2,  R  *  mj/2) 

11.  m  =  min(ml,m2) 

12.  if  (m  <  mindim)  then 

begin 

13.  mindim  =  m,    k  =       XLoci  =  Zj,    FLoc,  =  yj 

14.  if  (ml  <  m2)  then 

15.  mdf/i  =  mxl,    height  —  myl,    Orienti  =  0 

else 

16.  width  =  mx2,    height  =  my2,    Orienti  =  1 
end 


end 

end  FindBestCorner; 

Procedure  UpdateCornerList((u;,,  hi),  k); 

1.  Remove  [(xfc,  yk),  {pk,  qk)]  from  Corner  List 

2.  if  Orienti  =  1  then 

3.  i«i  =  /i,,    h'i  =  Wi 
else 

4.  w'i  =  Wi,    h'i  =  hi 

//If  the  grid  {wi  x  hi)  packed  at  k-th  free  corner  overlaps  with  the  packing 
area  of  a  free  corner  [(xj,  yj),  {pj,qj)],  update  the  size  of  that  free  corner.  / / 
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5.     for  each  element  [{xj,  y,),      <Zi)]  in  Corner  List  do 
begin 


6.  if  {Vk  <  Vi  <yk  +  h'i)  and  {xj  <  ifc  <  xj  +  pj)  then 

7.  Pi  =  Xk-  Xj 

8.  if  (ifc  <  XJ  <  Xfe  -f  w'i)  and  (y^  <      <  y,  +  qj)  then 

9.  qj  =  yk-  yj 

10.  if  Pj  =  0  or  9j  =  0  then 

11.  remove  [{xj,  yj),  {pj,  qj)]  from  CornerList 


end 

end  UpdateCornerList; 

Procedure  FindNewCornerSize((u;,,  hi)); 

II  Find  the  sizes  of  two  new  free  corners  created  after  packing  a  grid  {wi,  hi)  1 1 


1.  p  =  oo,    q  =  oo 

2.  if  Orienti  =  1  then 

3.  w'i  =  hi,    h'i  =  Wi 
else 

4.  w'i  —  Wi,    h'i  —  hi 


II  Find  the  closest  grid  to  the  new  corner  packed  on  line  y  =  YLoci.  1 1 

5.  for  each  element  [(x],  j/]),  (xj,  y])]  in  XScanList  do 

//  Each  element  [{x],yj),{x'j,yj)]  in  XScanList  contains  the  locations 
of  the  SW,  NW  corners  of  a  packed  item.  (Note  that  x]  =  xj)  1 1 
begin 

6.  if  [y]  <  YLoci  <  yj)  and  (XLoci  +  w'^  <  x])  then 

7.  if  (x)  <  p)  then 

8.  p  =  x] 
end 
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//  Find  the  closest  grid  to  the  new  corner  packed  on  line  x  =  XLoci  -\- w\.  1 1 

9.  for  each  element  [(x],  y]),  (i^,  y])]  in  YScanList  do 

//  Each  element  [{x],y]),{x},y])]  in  YScanList  contains  the  locations 
of  the  SW,  SE  corners  of  a  packed  item.  (Note  that  y]  =  y])  // 
begin 

10.  if  (x)  <  XLoci  +  w'i<  x])  and  {YLoci  <  y])  then 

11.  if  (y)  <  q)  then 

12.  q  =  y] 
end 

13.  p  =  p-  {XLoa  +  w'i) 

14.  q  =  q  —  YLoci 

15.  Insert  [{XLoci  +  w\,  YLoCi),  (p,  q)]  into  Corner  List 

16.  p  =  oo,    q  =  oo 

II  Find  the  closest  grid  to  the  new  corner  packed  on  line  y  =  YLoCi  +  h'^.  1 1 

17.  for  each  element  [(x],  j/]),  (x|,       in  XScanList  do 

begin 

18.  if  [y]  <  YLoci  +  h\  <  y])  and  {XLoa  <  x])  then 

19.  if  (x}  <  p)  then 

20.  p  =  x] 
end 

//  Find  the  closest  grid  to  the  new  corner  packed  on  line  x  =  XLoci.  II 

21.  for  each  element  [(x],  j/]),  (x^,       in  YScanList  do 

begin 

22.  if  (x}  <  XLoci  <  x|)  and  {YLoci  +  /tj  <  y])  then 

23.  if  (y]  <  g)  then 

24.  q  =  y] 
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end 

25.  p  =  p  —  XLoci 

26.  q  =  q-  {YLoci  +  h\) 

27.  Insert  [{XLoa,  YLoa  +  h'^),  (p,  q)]  into  Corner  List 

28.  Insert  [(JVTIoc,,  FIoc,),  (XZoci,  FLoci  +  h'^)]  into  XScanList 

29.  Insert  [(XLoc,,  FLoc,),  (XIoc,  +      YLoci)]  into  F^canXis* 
end  FindNewCornerSize; 

Every  time  a  grid  is  packed,  a  corner  is  used  and  two  new  corners  are  introduced. 
Since  there  is  only  one  corner  initially,  the  number  of  elements  in  C ornerList  is  at 
most  m  +  1.  An  element  is  added  to  XScanList  as  well  as  to  YScanList,  each  time 
a  grid  is  packed.  Hence,  the  number  of  elements  in  XScanList  and  YScanList  can 
be  at  most  m.  As  a  result,  Procedures  FindBestCorner,  UpdateCornerList  and 
FindNewCornerSize  take  0{m)  time.  Since  there  are  m  grids  to  be  packed,  the 
total  time  complexity  of  our  packing  algorithm  is  O(m^). 

Remark  :  A  small  improvement  can  be  made  if  we  remove  corners  which  are 
too  small  to  be  useful  from  CornerList.  Let  MINHOLESIZE  =  min(mini<,<m  tui, 
mini<,<,„  hi).  In  procedure  UpdateCornerList,  if  either  pj  <  MINHOLESIZE  or 
qj  <  MINHOLESIZE,  then  remove  [{xj,yj),{pj,qj)]  from  CornerList.  In  proce- 
dure FindNewCornerSize,  if  either  p  <  MINHOLESIZE  or  g  <  MINHOLESIZE, 
then  do  not  insert  [{XLoci  +  w'i,YLoci),{p,q)]  or  [{XLoci,YLoci  +  h'i),{p,q)]  into 
CornerList. 

The  following  theorem  shows  the  correctness  of  our  pcicking  cdgorithm.  In  the 
following  theorem,  we  assume  for  convenience  that  we  do  not  make  modifications  cis 
in  above  remark. 

Theorem  5.3.2  After  i-th  grid  is  packed,  The  following  two  claims  hold. 

(l)For  any  element  [{xj,yj),  {pj,  qj)]  in  the  CornerList,  the  maximum  size  of  the  grid 
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that  can  be  packed  at  corner  (xj,i/j)  is  pj  x  qj. 

(2) For  any  point  (x,y)  in  the  space  of  the  bin,  either  it  is  occupied  by  a  grid  or  it 
belongs  to  at  least  one  comer  in  the  Corner  List. 

Proof  :  To  prove  the  first  part  of  the  theorem,  we  need  to  show  that  procedure 
FindNewCornerSize  correctly  computes  the  sizes  of  the  two  new  corners.  We  will 
use  induction  on  the  number  of  grids  already  packed.  Base  case  is  trivial.  When 
j  is  1,  both  XScanList  and  YScanList  are  empty.  Hence,  the  loops  of  lines  5  - 
12  and  17  -  24  will  not  be  executed.  Both  p  and  q  remain  oo,  which  is  the  correct 
width  and  height  of  the  maximum  grid  that  can  be  packed  at  the  new  corners. 
Now,  assume  that  the  sizes  of  all  the  corners  had  been  correctly  computed  until  grid 
-  1  was  packed.  We  pack  grid  j  at  {XLoCj,YLocj),  and  two  new  corners,  namely, 
[{XLoCj  +  w'j,YLoCj),{p',q')]  and  [{X Locj ,Y LoCj  +  h'j),{p" ,q")]  are  created;  here 
w'j,  h'j  are  dimensions  of  grid  j  along  X  and  Y  directions  respectively.  Let  us  see  how 
p'  and  q'  are  determined  in  procedure  FindNewCornerSize.  XscanList  contciins 
the  coordinates  of  the  SW,  NW  corners  of  each  grid  which  was  already  packed. 
We  scan  each  element  in  XScanList  and  find  [{xl,yl),{xl,yl)]  such  that  (yl  < 
YLocj  <  yl)  and  {XLocj  +  w'j  <  xl),  and  xj^  is  minimum  among  all  such  elements; 
if  no  such  grid  exists,  then  p'  is  correctly  set  to  oo.  We  find  [(a;^,  yj/),  y^,)]  in 
YScanList  in  the  similar  way.  Figure  5.6  (a)  shows  such  grids  k,  k'.  In  procedure 
FindNewCornerSize  p'  and  q'  are  set  to  xl  -  {XLoCj  +  w'j)  and  yl,  -  YLocj 
respectively.  It  is  obvious  that  p'  and  q'  cannot  be  greater  than  the  above  values  from 
the  figure.  For  both  p'  and  q'  to  be  correct,  the  shaded  area  in  Figure  5.6  (a)  should 
not  be  occupied  by  any  grids  which  were  already  packed.  Suppose  that  there  is  a  grid 
/  packed  at  {XLoci,YLoci),  and  YLoci  <  YLock>  and  XLoci  <  XLock>,  that  is,  grid 
/  is  in  the  shaded  area.  Figure  5.6  (b)  shows  such  a  grid  /.  Since  every  grid  including 
grid  /  was  packed  at  a  valid  corner,  there  must  be  a  grid  n  either  to  the  left  of  or  below 
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grid  /.  Assume  that  grid  n  is  located  below  grid  /.  Now,  we  apply  the  same  argument 
to  grid  n,  and  so  on.  Eventually,  there  must  be  a  grid  t  packed  at  {XLoct,YLoCi) 
such  that  XLoct  <  XLock  and  {YLoct  <  YLocj  <  YLoa  +  h[).  Figure  5.6  (c)  shows 
such  a  grid  t.  This  contradicts  the  fact  that  is  the  minimum  among  those  grids. 
Therefore,  there  cannot  exist  such  a  grid  /  and  procedure  FindNewCornerSize 
correctly  determines  p'  and  q'.  Similarly,  we  can  prove  that  p"  and  q"  are  correctly 
determined. 


k" 


LI 


k' 


(a)  (b)  (0 

Figure  5.6.  Figures  for  The  Proof  of  Theorem  5.3.2 


Proving  the  second  part  of  the  theorem  is  straightforward.  We  showed  that  sizes 
of  corners  are  correctly  computed.  Since  grids  do  not  overlap  in  our  packing,  any 
point  (x,  y)  is  occupied  at  most  one  grid.  We  can  use  an  induction  similar  to  the  one 
used  for  the  first  part  of  the  theorem  to  prove  that  (x,  y)  belongs  to  the  free  space 
of  at  least  one  corner  if  it  is  not  occupied.  Base  case  (i.e.  that  is,  when  no  grids 
have  been  packed)  is  trivial.  Let's  assume  that  the  claim  is  true  after  grid  j  —  1  had 
been  packed.  Now,  we  pack  grid  j  at  corner  [{XLoCj,YLocj),{p,q)].  This  corner  is 
removed  from  Corner  List  and  two  new  corners,  [{XLocj  +  w'j,YLocj),{p' ,q')]  and 
[{XLocj,  YLocj  +  hj),  (p",  q")]  are  added,  where  w'j,  hj  are  dimensions  of  grid  j  along 
X  and  Y  directions  respectively.  If  p'  and  q'  are  correctly  computed,  p'  and  q'  must 
be  at  least  p  —  w'j  and  q  respectively.  From  the  same  reasoning,  p"  and  q"  must  be  at 
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least  p  and  g  -  h'j  respectively.  Therefore,  for  any  point  (x,y)  which  belonged  to  the 
free  space  of  the  corner  [{XLoCj,YLoCj),  (p,  q)]  before,  either  it  is  occupied  by  grid  j 
or  it  belongs  to  one  of  the  two  new  corners.  ■ 

To  show  a  bound  for  the  accuracy  of  the  solutions  provided  by  our  packing  al- 
gorithm, we  impose  the  following  restrictions  on  packing  the  grids.  When  Wi  x  hi 
{wi  >  hi)  grid  is  packed  at  a  corner  (x,r/),  it  should  be  placed  so  that  the  side  with 
dimension  to,  (long  side)  is  parallel  to  the  X-axis  if  x  <  Ry  and  it  should  be  placed 
with  long  side  parallel  to  the  F-axis  ii  x  >  Ry.  Suppose  there  is  a  free  corner  that 
cannot  accommodate  the  item  in  the  allowed  orientation  but  can  accommodate  the 
item  in  the  other  orientation.  Then  we  disregard  this  corner  though  it  is  possible  to 
get  a  better  packing  by  placing  the  item  in  that  corner.  If  x  =  Ry,  then  both  orien- 
tations of  the  grid  are  allowed  for  that  corner.  We  will  call  this  packing  algorithm 
modified  TP-heuristic.  Now  we  state  a  result  on  the  solution  accuracy  bound  when 
i2  =  1  (i.e.  for  square  meshes) 

Theorem  5.3.3  Consider  the  two-dimensional  packing  problem  with  R  =  1  and  let 
w*  =  max,  10, .  Let  Wopt,  Hopt  be  the  width  and  height  of  the  optimal  packing  and 
let  Wtp,  Htp  be  the  corresponding  values  for  the  packing  given  by  the  modified  TP- 
heuristic  when  items  are  packed  in  decreasing  order  of  their  maximum  side  lengths. 
Assume  without  loss  of  generality  that  Wopt  >  Hopt  ('■nd  Wtp  >  Htp-  Then  Wtp  < 
V2Wopt  +  Sw\ 

Proof :  Let  us  denote  the  regions  below  and  above  the  line  OP  (that  has  unit  slope) 
by  Ri  and  R2.  First  we  observe  that  there  is  always  a  corner  in  Ri  as  well  as  in  R2 
that  can  accommodate  a  subsequent  item  in  the  allowed  orientation  (i.e.  long  side 
parallel  to  F-axis  in  Ri  and  par<illel  to  X-axis  in  R2).  This  follows  from  the  fact 
that  we  can  always  pack  a  subsequent  item  q  abutting  to  V-axis  (or  X-axis).  Since 
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the  item  q'  below  (or  to  the  left  of)  q  was  packed  prior  to  q,  the  maximum  side  of  q' 
is  longer  than  the  maximum  side  of  q.  Hence,  there  is  always  enough  space  to  pack 
item  q  above  (or  to  the  right  of)  item  q'. 

Now,  we  show  that  Wtp  -  Htp  <  w*  as  follows.  Let  Wfp,  Wjp  be  the  width  and 
height  of  the  packing  after  the  i-th  item  is  packed.  Suppose  that  Wtp  -  Htp  >  W. 
Then  there  must  be  an  item  i  of  dimensions  (iw,-,  hi)  such  that  that  Wt~p  -  H^fp  ^  ^' 
and  Wfp  -  Hipp  >  w*.  It  also  must  be  true  that  the  i-th  item  was  packed  at  a 
corner  in  Ru  hence  W'fp  <  W^p  and  Wf^  <  Hipp.  Let  (x,y)  be  the  location  of 
a  corner  in  R2  which  can  accommodate  the  i-th  item.  Then  x  <  y  <  H*tp-  Since 
x  +  Wi<  Hifp  +  <  Wi^p  and  y -\- hi  <  H'fp  ■\- w*  <  Wfp,  the  maximum  of  the 
width  and  height  of  the  packing  would  be  smaller  if  the  item  i  were  packed  at  the 
comer  at  (x,y).  Hence  the  item  i  should  not  have  been  packed  at  a  corner  in  R\. 
Since  we  always  pack  items  so  that  the  maximum  of  the  width  and  height  of  the 
packing  is  minimized,  we  can  conclude  that  Wtp  —  Htp  <  w*. 

We  only  have  to  prove  our  result  when  Wtp  >  3iw*  in  which  case  Htp  >  2txj*. 
Consider  the  two  isosceles  right  triangles  Ai  and  A2  (see  Figure  5.7)  in  regions  Ri  and 
R2  with  areas  {^rp-'^'"')\  {Htp-^^'?  respectively.  Note  that  all  items  in  the  region 
Ai  {A2)  have  been  packed  with  their  long  sides  parallel  to  Y  {X)  axis.  Thus  the  items 
are  packed  in  these  regions  as  in  bottom-up  left-justified  (BL  for  short)  strategy  of 
[5].  We  can  make  the  similar  ajgument  on  the  occupancy  of  A\  and  A2  as  in  [5]. 
Note  that  any  vertical  (or  horizontal)  cut  through  Ai  (or  ^2)  can  be  partitioned  into 
alternating  segments  corresponding  to  cuts  through  unoccupied  and  occupied  areas. 
Using  the  fact  that  there  are  items  to  the  right  of  Ai  and  above  A2  and  considering 
the  order  in  which  the  items  are  packed,  we  can  show  that  the  sum  of  the  occupied 
segments  is  at  leaist  the  sum  of  the  unoccupied  segments.  By  integrating  the  lines 
over  Wtp  —  2w*  (or  Htp  —  2u;*),  we  can  verify  that  A\  (or  A2)  is  at  least  half  full. 
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that  W^,,  >  1/4  {{Htp  -  2w*f  +  {Wtp  -  Iw^f)  >  iMxL^  and  the 


This  means  mat  rr^p^ 
result  follows  from  the  fact  that  Wtp  —  Hjp  <  w*. 


-2  w* 


w*  -2  w*  W 

Figure  5.7.  Regions  Ai  and  A2  in  the  Packing  (Theorem  5.3.3) 

The  bound  can  be  improved  when  the  items  to  be  packed  are  square  shaped. 

Theorem  5.3.4  Consider  the  same  2D  packing  problem  as  in  Theorem  5.3.3  except 
that  the  items  are  square  shaped.  If  the  items  are  packed  in  decreasing  order  of  their 
sizes  in  the  modified  TP-heuristic,  then  Wtp  <  V^Wopt  +  2w*. 

Proof  :  The  proof  of  this  theorem  is  similar  to  the  proof  of  the  previous  theorem.  It 
can  be  eeisily  shown  that  Wtp  —  Htp  <  w*.  Since  all  the  items  are  square  shaped, 
they  are  packed  as  in  BL  strategy  [5]  in  the  two  isosceles  right  triangles  and  S2  (see 
Figure  5.8)  with  areas  V^tf-^  )  ^{Hxp-w  )  respectively.  Using  the  fact  that  there  are 
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items  to  the  right  of  and  above  52  and  considering  the  order  in  which  the  items 
are  packed,  we  can  show  that  Si  and  S2  are  at  least  half-occupied.  This  means  that 
^op<  ^  1/^{{Htp  -  w'Y  +  {Wtp  -  w'f)  >  ^"tf-^'?  and  the  result  follows  from 
the  fact  that  Wjp  -  Hjp  <w*.  ■ 


H 


TP 


-w* 


Figure  5.8.  Regions      and  S2  in  the  Packing  (Theorem  5.3.4) 


The  order  in  which  we  consider  the  grids  for  packing  will  affect  the  performance 
of  our  packing  algorithm.  Since  small  grids  have  a  better  chance  to  fit  into  holes 
without  increasing  the  width  or  height  of  the  packing,  it  is  intuitively  advantageous 
to  pack  them  later.  Assuming  that  wi  >  (1  <  i  <  m),  we  consider  the  following 
four  orderings  of  grids  to  be  reasonable  for  good  performance. 

1.  Decreasing  order  of  iw,- 
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2.  Decreasing  order  of  hi 

3.  Decreasing  order  of  Wi  *  hi 

4.  Decreeising  order  of  ^ 

Since  sorting  the  grids  can  be  done  in  O(mlogm)  time,  the  time  complexity  of  our 
packing  algorithm  will  be  same  regardless  of  which  grid  ordering  is  used. 

5.3.2    A  Parallel  Packing  Algorithm 

The  packing  algorithm  described  in  the  previous  section  is  sequential.  One  pro- 
cessor has  to  collect  all  the  information  about  the  grids  from  the  other  processors  and 
execute  the  packing  algorithm  in  order  to  find  the  processor  allocation.  The  result 
of  allocation  should  be  communicated  to  all  the  processors  which  in  turn  communi- 
cate the  boundary  values  for  the  fine  grids  to  the  corresponding  processors.  In  this 
section,  we  present  a  parallel  algorithm  in  which  m  processors  cooperatively  execute 
the  packing  algorithm  for  m  grids  in  order  to  speed  up  the  algorithm.  The  same 
packing  method  that  was  used  in  the  sequential  algorithm  will  be  used  in  the  parallel 
algorithm  described  here. 

In  the  adaptive  mesh  algorithm,  each  rectangular  grid  at  level  i  may  be  generated 
across  more  than  one  processor.  One  of  those  processors  (with  lowest  index)  produces 
a  packet  <  (wj,hj),Sj  >  for  grid  j,  where  {wj,hj)  is  the  dimension  of  grid  j  and 
Sj  is  its  own  processor  index.  These  packets  contain  the  necessciry  information  that 
forms  the  input  data  to  our  packing  and  processor  allocation  algorithms,  processors 
for  them.  The  detailed  description  of  the  algorithm  is  given  below.  The  topology 
of  multiprocessors  on  which  the  algorithm  runs,  is  assumed  to  be  either  mesh  or 
hypercube. 
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Algorithm  Parallel  Packing 

(a)  Let  PS  be  a  set  of  m  processors  forming  a  submesh  or  a  subcube,  that  is 
PS  =  {pbi\0  <  i  <  m  -  1}.  The  processors  which  produced  packets  send 
them  to  the  processors  in  PS  (one  packet  per  processor)  using  procedure 
TokenPacking.  The  processors  in  PS  will  do  the  remaining  steps  of  our 
paraillel  packing  algorithm. 

(b)  Sort  packets  according  to  the  packing  order  using  a  parallel  sorting  algorithm. 
After  sorting,  assume  that  processor  Pj,  is  holding  packet  <  {wi,hi),Si  >. 

(c)  Store  the  initial  free  corner  [(0, 0),  (oo,  oo)]  at  processor  Pb^.  Each  processor  Pt- 

set  both  width  and  height  to  0.  Now,  pack  the  grids  one  by  one  by  performing 
m  iterations  where  in  the  i-th  iteration  (0  <  i  <  m  -  1)  call  procedure 
GridPacking(i). 

(d)  After  step  (c),  each  processor  Pb^  has  {XLoci,YLoci),  that  is,  the  free  corner 
where  grid  (iw,,  A,)  was  packed  and  the  width  and  height  of  the  packing.  Each 
processor  Pbi  include  the  above  information  in  its  packet  <  (wi,hi),Si  >  and 
send  it  to  Si  using  procedure  SendPacketToSource. 

end  Parallel  Packing; 
Procedure  TokenPacking; 

/ /  Here  a  subset  of  processors  Pjg ,  Pjj , . . . ,  Pj,  with  jo  <  ji  <  •  •  <  ji  has  one  packet 
each  and  it  is  desired  to  store  the  packet  of  Pj^  in  Pk  for  t  <  k  <  {t  -{■  I)  mod  m  for 
some  0  <  t  <  N  —  I.  Using  greedy  routing,  this  operation  can  be  done  in  0{\ogN) 
time  on  a  hypercube  and  0{y/N)  time  on  a  mesh  [26].  // 

Procedure  GridPacking(i); 

1.  Call  procedure  Broadcast(6,-,  <  {wi,hi),Si  >). 
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2.  Call  parallel  procedure  FindBestCorner((u;,,  hi),  k)  to  find  the  corner  {xk,  yk) 
where  the  grid  (iw,, /ij)  is  to  be  packed,  and  its  orientation. 

3.  Processor  P;,.  set  XLoci  and  YLoci  to  Xk  and  yk  respectively,  and  set  OrienU  to 
Orient k.  Also  determines  the  width  and  height  of  the  packing  after  {wi,  hi) 
is  packed. 

Call  procedure  Broadcast(6„  ((xjt,  yk),  {width,  height),  Orienti)). 

4.  Call  parallel  procedure  UpdateCorners((u;,-,  hi),  k)  to  update  the  sizes  of 
corners  after  the  grid  {wi,hi)  is  packed  at  {xk,yk)- 

5.  Call  Parallel  procedure  FindNewCornerSize((u;„ /i,).  A;)  to  determine  the 
sizes  of  two  new  corners. 

Procedure  FindBestCorner((tw,,  hi),  k); 
1.    Each  Processor  Pb^  does  the  following 
begin 

Let  [ixj,yj),{pj,qj)]  the  corner  Pbj  is  holding 
if  {pj  —  Wi)  >  0  and  {qj  —  hi)  >  0  then 

mil  =  max{width,Xj  +  Wi),    myl  =  ma,x{height,yj  +  hi) 
else  if  {pj  —  hi)  >  0  and  (qj  —  iw,)  >  0  then 

mx2  =  max{width,Xj  +  hi),    my2  =  ma.x{height,yj  +  Wi) 
else  set  m  to  oo  and  goto  the  next  step 
ml  =  max(mzl,  R  *  myl) 
m2  =  mcix(mi2,  R  *  mj/2) 
rrij  =  min(ml,m2) 
if  (ml  <  m2)  then    Orient j  =  0 
else    Orient  j  =  1 
end 


If  Pbj  has  two  corners  then  choose  the  one  which  gives  smaller  nij. 

If  Pbj  does  not  have  any  corners  then  set  rrij  to  oo. 
2.    Call  parallel  procedure  FindMin({m,}I^o^ ,  m^,  6.) 
end  FindBestCorner; 

Procedure  UpdateCorners((ii;,,  hi),  k); 

1.  Processor  Pb^  remove  [{xk,  j/t),  (pjt, 

2.  Each  Processor  Pb^  does  the  following 

begin 

Let  [{xj,yj),{pj,qj)]  the  corner  Pb^  is  holding 
if  Orienti  =  1  then 

w'i  =  ^j,    h'i  =  Wi 
else 

u;-  =  Wi,    h'i  =  hi 
if  iVk  <  Vj  <yk  +  h'i)  and  {xj  <  Xk  <  Xj  +  pj)  then 
Pj  =  a;*;  —  Xj 

if  (xjt  <  Xj  <  xjt  +  u;-)  and  (j/j  <  yjt  <  Vj  +  9j)  then 

9i  =  yk-  Vj 

if  pj  =  0  or  9j  =  0  then 
remove  [(xj, yj),  (pj, gj)] 

end 

If  Pb-  has  two  corners  then  update  the  second  one  in  the  same  way. 
end  UpdateCorners; 

Procedure  FindNewCornerSize((ii;,-,  hi),  k); 
1.    Each  Processor  Pbj{j  <  i)  does  the  following 
begin 

if  Orienti  =  1  then 
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w'i  =  hi,    h'i  =  Wi 
else 

w'i  =  Wi,    h'i  =  hi 
if  {YLocj  <yk<  YLocj  +  hj)  and  {xk  +  w'i<  XLocj) 

then  pj  =  XLocj 
else  Pj  =  oo 

if  (XLocj  <  Xfc  +  w\  <  XLocj  +  luj)  and  (j/t  <  rioCj) 

then  qj  =  YLocj 
else     =  oo 

2.  Call  parallel  procedure  FindMin({p/})Io,p,  k) 

3.  Call  parallel  procedure  FindMin({g/}j^J,  q,  hi) 

4.  Find  p'  and     in  the  same  way  for  the  other  corner. 

5.  Store  the  two  new  corners,  [{XLoCi  +  w'i,  YLoci),  {p,  q)]  and 
[{XLoci,YLoci  +  h'i),{p',q%  at  Processor  Pj,. 

end  FindNewCornerSize; 

Procedure  Broadcast(j,  V); 

//  Processor  Pj  broadcasts  value  V  to  all  the  processors  in  PS;  takes  0(log  A'^)  time 
in  a  hypercube  and  0{y/N)  time  in  a  mesh  [26].  // 

Procedure  FindMin({a,}"r(/, 6,;); 

//  Given  the  a/  values  with  one  value  per  processor,  find  the  minimum  (6)  of  these 
values  eind  store  it  in  the  processor  Pj.  This  operation  can  be  done  in  O(log  N)  time 
on  a  hypercube  and  0{\/N)  time  on  a  mesh.  / / 

Procedure  SendPacketToSource 

1.  Each  processor  Pi,-  include  XLoci,  YLoci,  Orienti,  width  and  height  in  the 
packet  <  {wi,hi),Si  >. 
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2.  For  a  hypercube,  sort  the  packets  according  to  5",-  using  a  parallel  sorting  algo- 
rithm. 

3.  Send  each  packet  <  {wi,  hi),  Si,  {XLoci,YLoci),Orienti,  (width,  height)  >  to 
processor  Si.  For  this,  use  the  monotone  routing  (see  chapter  3)  for  a  hypercube 
and  one-to-one  routing  [26]  for  a  mesh. 

end  SendPacketToSource 

Finding  a  corner  to  pack  the  next  grid,  updating  the  sizes  of  corners  and  deter- 
mining the  sizes  of  two  new  corners  are  done  in  the  same  way  as  in  the  sequential 
packing  algorithm.  After  executing  the  above  algorithm,  processor  Si  can  find  a  sub- 
mesh  for  grid  {wi,hi)  using  the  information  included  in  the  packet. 
Time  complexity  analysis  for  a  hypercube  : 

Step  (a)  takes  O(log  N)  time  and  step  (b)  takes  0(log^  m)  time.  Both  procedure 
FindBestCorner  and  FindNewCornerSize  take  O(log  m)  time,  procedure  Up- 
dateCorners  takes  a  constant  time.  Hence  step  (c)  takes  O(mlogm)  time.  Step 
(d)  takes  O(log  N  +  log^  m)  time.  The  total  time  complexity  for  a  hypercube  is 
0(log  N  +  m\og  m). 

Time  complexity  analysis  for  a  hypercube  : 

Step  (a)  takes  0{y/N)  time  and  step  (b)  takes  0{y/rn\ogm)  time.  Both  procedure 
FindBestCorner  and  FindNewCornerSize  take  0{y/m)  time,  procedure  Up- 
dateCorners  takes  a  constant  time.  Hence  step  (c)  takes  0{m-^m)  time.  Step  (d) 
takes  0{VN)  time.  The  total  time  complexity  for  a  mesh  is  0{y/N  +  my/m). 

5.3.3    Allocation  of  Processors 

In  the  previous  sections,  we  described  algorithms  for  packing  grids  in  such  a  way 
that  the  ratio  of  width  to  height  is  as  close  to  the  mesh  ratio  as  possible  and  the 
packing  area  is  minimized  at  the  same  time.  Now,  we  are  ready  to  allocate  a  set  of 


Ill 


processors  (submesh)  for  each  grid  using  the  results  of  packing.  Let  W  and  H  be 
the  width  and  height  of  the  packing,  and  let  (x,-,  yi)  he  the  location  of  the  south-west 
corner  of  grid  i  with  dimensions  tw,  x  hi.  Assume  that  we  are  given  a  P  x  Q{P  >  Q) 
mesh  of  multiprocessors  and  let  W>H.  Let  (x  x  j/))  be  a  x  x  y  submesh  with 

its  south-west  corner  processor  indexed  We  propose  the  following  two  ways  to 

allocate  submeshes  to  grids. 

1.  Assign  submesh 

((b.#J,  Ly.iJ),  (U^.-  +       -  b.-^J)  X  (L(2/.-  +      -  lyS\))  t°  g"^ 

2.  If  (p  <        then  assign  submesh 

((L^.iJ,  bSJ),  (L(^.-  +  «^.)iJ  -  L^.iJ)  X  ilivi  +       -  IvSl))  to  g"d  i. 

else  assign  submesh 

((b.-^J,  Ly.-^J)>  (L(^.-  +  ^i)w\  -  b.-^J)  X  iliy^  +  '^•O^J  -  Lj/.#J))  to  grid  i. 

The  first  method  tries  to  use  as  many  processors  as  possible  to  reduce  the  compu- 
tational workload,  but  may  violate  the  property  that  =  ^  for  all  the  grids.  In  the 
second  method,  we  allocate  processors  in  such  a  way  that  ^  «  jj--  In  other  words, 
the  second  method  uses  uniform  scaling  in  X  and  Y  directions  while  the  first  method 
,  uses  different  scaling  factors  in  X  and  Y  directions  in  order  to  reduce  the  packing  to 
the  given  mesh  size.  In  the  second  method,  we  may  not  utilize  all  the  processors  in 
the  mesh.  We  will  call  the  first  and  second  method  non-uniform  scaling  and  uniform 
scaling  respectively.  Both  of  the  methods  are  identical  to  each  other  when  p  =  ^• 

To  make  sure  that  both  X,-  and  Yi  are  not  zero,  that  is,  at  least  one  processor 
is  allocated  to  each  grid,  we  impose  the  following  lower  bound  on  the  size  of  each 
grid.  Let  D  =  X^,- max  (it;,-,  A,).  Z)  is  a  very  loose  upper  bound  on  ma,x{W,H)  where 
W  tmd  H  are  width  and  height  of  any  packing.  Then,  min,  (min(u;,, /^,))  > 
should  be  satisfied.  To  achieve  better  bound  on  D,  we  can  quickly  pack  the  grids 
in  the  following  way.  Each  time  a  grid  is  packed,  we  examine  only  two  corners,  one 
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on  X-axis  and  the  other  on  Y-axis.  Then  use  maLx{width,  height)  as  D.  Since  we 
always  consider  only  two  corners,  this  packing  takes  only  0(m)  time.  After  finding 
the  lower  bound  on  the  size  of  the  grids,  change  u;.  or  hi  to  this  value  if  they  are 
smaller  than  that. 

5.3.4    Processor  Allocation  for  Hvpercubes 

If  we  are  given  a  hypercube,  we  can  still  use  the  above  method  to  allocate  pro- 
cessors for  grids  assuming  that  we  are  given  a  mesh.  Then,  we  can  easily  embed 
the  mesh  in  the  given  hypercube  using  reflected  Gray  code  [10].  The  reflected  Gray 
code  is  generated  in  the  following  way.  We  start  with  the  1-bit  Gray  code  sequence 
{0, 1},  eind  then  insert  a  zero  and  a  one  in  front  of  the  two  elements  obtaining  the 
two  sequences  {00,01}  and  {10, 11}.  We  then  reverse  the  second  sequence  to  obtain 
{11, 10},  and  then  concatenate  the  two  sequences  to  obtain  the  2-bit  reflected  Gray 
code. 

{00,01,11,10} 

Generally,  given  a.  {d  —  l)-bit  code 

{bi,b2,. . .  ,bp}, 

where  p  =  2''"^  and  6i,  63, . . . ,  6p  are  binary  strings,  the  corresponding  <f-bit  code  is 

{061,062,...,  06p,16p,16p_i,...,16i}. 

The  preceding  recursive  construction  of  the  reflected  Gray  code  sequence  can  be 
generalized.  Let  da  and  db  be  positive  integers,  and  let  d  =  da  +  di,.  Suppose  that 
{ai,a2,...,ap.}  and  {61,  62, . . . ,  6p^}  are  the  da-hit  and  (ij-bit  reflected  Gray  code 
sequences,  where  pa  =  2''"  and  pb  =  2'^''.  Consider  the  pa  x  pi  matrix  of  d-hxi  strings 
{aihj\i  =  1,2,. ..,pa,;  =  1,2,. ..,pb}. 
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02^1       02^2      ■  ■  ■  ^i^pb 

m  »  •  • 

•  .  •  • 

\  ap.fci   ap„&2    •  •  •   o.paKb  ) 
The  preceding  construction  also  shows  that  the  nodes  of  a  d-cnhe  can  be  arranged 
along  a  two-dimensional  mesh  with  pa  and  ph  nodes  in  the  first  and  second  dimen- 
sions, respectively.  The  (i,i)-th  element  of  the  mesh,  where  i  =  1,2,  ...,Po  and 
;  =  1, 2, . . .  ,p(,,  is  the  d-cube  node  with  identity  number  a.-fej. 

In  iV-node  hypercube,  we  can  embed  log  ^/N  -\-  1  different  meshes  with  power  of 
two  nodes  in  each  dimension,  such  as  (2i°8^  x  2^-^^),  ^2'°^Vn+i  ^  2i«>8v^-i),  •  •  • , 
(2^°8^  X  2°).  We  can  pick  one  of  the  above  meshes  and  pack  the  grids  using  its  mesh 
ratio.  Or,  we  can  try  for  all  the  above  meshes  and  pick  the  one  which  gives 
closest  to  the  corresponding  mesh  ratio.  Since  we  use  only  m  processors  in  the  parallel 
packing  algorithm,  we  can  pack  the  grids  using  [^J  different  mesh  ratios  at  the  same 
time.  If  we  try  all  log  y/N  +  1  meshes  using  the  parallel  packing  algorithm,  we  have 
to  repeat  the  packing  algorithm  [-"-(H^^+i)]  times.  This  takes  C>(!2ii2&^l2£V^)  ti^^e 
since  one  execution  of  the  packing  algorithm  takes  0{m  log  m)  time. 

5.3.5    Experimental  Results 

We  performed  experiments  to  measure  the  performance  of  our  approximation 
algorithm  based  on  the  TP-heuristic.  As  in  the  experiments  for  the  first  approach, 
we  start  with  some  number  of  (cocirse)  grids  and  assume  that  at  each  level  of  adaptive 
mesh  refinement,  the  number  of  fine  grids  (child  grids)  created  within  each  coarse 
grid  region  is  exponentially  distributed  with  a  mean  value  of  1.  The  number  of  mesh 
points  in  each  fine  grid  is  uniformly  distributed  in  the  range  [MINSIZE,MAXSIZE]. 
The  aspect  ratios  of  fine  grids,  namely  ^  {wi  >  hi),  are  also  cissumed  to  be  uniformly 
distributed  in  the  range  [1,MAXRATI0].  In  our  experiments,  we  went  through  200 
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levels  of  refinement.  For  assignment  of  fine  grids  at  each  level,  we  first  solve  the 
two-dimensional  packing  problem  and  then  allocate  a  submesh  for  each  grid  using 
the  methods  described  in  the  previous  section. 

We  compared  the  performance  of  our  algorithm  with  that  of  an  algorithm  based  on 
the  modified  first-fit  level  heuristic  (LP)  discussed  in  [18]  for  packing  rectangles  into 
a  bin  of  fixed  width.  In  this  heuristic,  rectangles  (grids)  are  considered  in  decreasing 
order  of  height  and  packed  level  by  level  as  in  the  first-fit  heuristic.  But  successive 
levels  are  packed  left-right  and  right-left  in  the  bin  of  fixed  width  so  that  tallest  item 
in  a  level  will  be  above  the  shortest  in  the  previous  level.  This  will  allow  the  items  to 
drop  down  after  packing  which  may  in  turn  reduce  the  total  height  of  the  packing. 
Since  in  our  problem,  the  width  is  not  fixed,  we  have  to  apply  an  iterative  process 
where  in  each  iteration,  we  use  the  LP-heuristic  to  determine  the  height  of  the  packing 
for  some  fixed  width.  The  initial  width  is  set  equal  to  \/RS  where  S  =  Ei(«^«  *  ^f) 
and  R  is  the  mesh  ratio.  Then,  the  width  is  increased  by  a  small  amount  (1  %  of 
the  width)  each  time  until  we  get  a  ratio  for  ^  that  is  just  above  the  processor  mesh 
ratio  ^. 

The  objective  function  values  were  computed  using  the  following  formula. 

where  Ucomp,  Ucomm  and  F  are  assumed  to  be  1  for  simplicity.  First,  we  compare  the 
solution  values  obtained  by  the  non-uniform  and  uniform  scaling  processor  allocation 
methods  in  Figure  5.9.  Here,  we  started  with  40  grids.  MINSIZE  and  MAXSIZE  are 
set  to  it  ♦  0.7  and  k  *  1.3  where  the  constant  k  is  set  based  on  the  average  number 
of  mesh  points  per  processor  which  is  assumed  to  be  300.  32  x  32  processor  mesh 
was  used  for  assignment.  The  experiment  was  repeated  for  different  cispect  ratios 
of  grids.  Figure  5.9  shows  that  the  non-uniform  scaling  method  performs  slightly 
better  than  the  uniform  scaling  method.  The  reason  for  the  narrow  difference  in  the 
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performances  of  the  two  methods  is  attributed  to  the  fact  that  our  algorithm  gives  a 
packing  whose  ^  ratio  is  close  to  J  ratio  of  the  mesh.  In  our  remaining  experiments, 
we  use  only  the  non-uniform  scaling  method  for  allocation  of  submeshes.  Figure  5.10 
shows  the  solution  values  obtained  after  applying  four  different  packing  orders  of 
grids  in  TP-heuristic.  From  the  results,  we  see  that  packing  grids  in  decreasing  order 
of  grid  size  (wi  *  hi)  performs  the  best,  while  packing  them  in  decreasing  order  of 
their  aspect  ratio  (^)  performs  the  worst.  We  use  only  decreasing  order  of  grid  size 
as  a  packing  order  in  our  remaining  experiments. 

Figure  5.11  -  Figure  5.26  show  comparison  of  TP  and  LP  heuristics  with  different 
values  of  various  parameters.  In  the  figures,  we  show  the  cumulative  sum  (across 
200  levels)  for  (a)  computational  cost,  (b)  communication  cost,  (c)  sum  of  (a)  and 
(b)  and  (d)  average  processor  utilization  which  indicates  how  tightly  the  grids  are 
packed  in  the  packing  algorithm.  The  processor  utilization  is  the  ratio  of  the  number 
of  processors  used  in  allocation  to  the  total  number  of  processors.  Figure  5.11  and 
Figure  5.14  show  how  the  performance  of  each  heuristic  varies  with  variance  in  grid 
size  (i.e.  number  of  grid  points).  In  this  experiment,  MINSIZE  and  MAXSIZE  are 
set  to  A;  *  (1-VAR)  and  k  *  (1-l-VAR)  respectively,  and  VAR  varies  from  0.1  to  0.9. 
We  used  the  same  values  for  other  parameters  as  in  the  previous  two  experiments. 
Figure  5.15  and  Figure  5.18  show  how  the  performance  of  each  heuristic  varies  with 
the  aspect  ratios  of  the  grids.  Figure  5.19  cind  Figure  5.22  show  how  the  perfor- 
mance of  each  heuristic  varies  with  processor  mesh  ratio.  In  this  experiment,  we  use 
32*MESHRATIOx32  mesh  for  assignment,  where  MESHRATIO  varies  from  1  to  3. 
Figure  5.23  and  Figure  5.26  show  how  the  performance  of  each  heuristic  varies  with 
initial  number  of  grids.  Below  we  give  a  summary  of  our  observations. 

1.  As  can  be  seen  in  Figure  5.11  and  Figure  5.14,  the  TP-heuristic  performs  bet- 
ter than  the  LP-heuristic  when  the  vaxiation  in  grid  sizes  is  reasonably  large. 
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5.10.  The  Comparison  of  Four  Different  Packing  Orders 
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Figure  5.11.  Computation  Costs  for  Different  Grid  Size  Variances 
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Figure  5.12.  Communication  Costs  for  Different  Grid  Size  Variances 
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Figure  5.13.  Total  Costs  for  Different  Grid  Size  Variances 


84 
83 
82 
81 
80 
79 
78 
77 
76 


75 


0.1 


TP-Heuristic  -  ♦ 
LP-Heuristic  -»- 


03 


OS 
VAR 


0.7 


03 


Figure  5.14.  Processor  Utilizations  for  Different  Grid  Size  Variances 
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Figure  5.15.  Computation  Costs  for  Different  Aspect  Ratios  of  Grids 
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Figure  5.16.  Communication  Costs  for  Different  Aspect  Ratios  of  Grids 
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Figure  5.17.  Total  Costs  for  Different  Aspect  Ratios  of  Grids 
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Figure  5.18.  Processor  Utilizations  for  Different  Aspect  Ratios  of  Grids 
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Figure  5.19.  Computation  Costs  for  Different  Aspect  Ratios  of  Processor  Meshes 
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Figure  5.20.  Communication  Costs  for  Different  Aspect  Ratios  of  Processor  Meshes 
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Figure  5.21.  Total  Costs  for  Different  Aspect  Ratios  of  Processor  Meshes 
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Figure  5.22.  Processor  Utilizations  for  Different  Aspect  Ratios  of  Processor  Meshes 
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Figure  5.23.  Computation  Costs  for  Different  Numbers  of  Grids 
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Figure  5.24.  Communication  Costs  for  Different  Numbers  of  Grids 
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Figure  5.25.  Total  Costs  for  Different  Numbers  of  Grids 
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Figure  5.26.  Processor  Utilizations  for  Different  Numbers  of  Grids 
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The  processor  utilization  also  increases  with  the  variation  in  grid  sizes  for  TP- 
heuristic.  Since  the  processor  utilization  indicates  how  tightly  the  grids  are 
packed  in  the  packing  algorithm,  this  result  demonstrates  that  our  proposed 
heuristic  gives  good  packings  in  practice. 

2.  Figure  5.15  and  Figure  5.18  show  that  LP  outperforms  TP  only  when  the  grids 
are  square  or  close  to  being  square  shaped.  But  as  the  aspect  ratio  increases, 
TP  produces  better  solutions  than  LP  as  indicated  by  smaller  computational 
and  communication  costs  as  well  as  higher  processor  utilization.  In  both  TP 
and  LP  heuristics,  the  total  cost  increases  with  the  maximum  aspect  ratio  of 
the  grids.  This  implies  that  it  is  difficult  to  pack  the  grids  tightly  when  the 
variance  in  aspect  ratios  is  large. 

3.  The  following  reason  can  be  attributed  to  the  better  performance  of  TP  over  LP 
for  large  variation  in  grid  sizes  or  larger  aspect  ratios  :  TP  allows  smaller  grids 
that  are  packed  later  on,  to  be  packed  into  holes  created  during  the  pticking  of 
larger  grids  and  this  has  the  effect  of  keeping  the  width  and  height  of  packing 
as  small  as  possible. 

4.  Figure  5.19  and  Figure  5.22  show  that  LP  outperforms  TP  when  processor  mesh 
ratio  is  larger  than  2.  Higher  processor  utilization  for  LP  heuristic  indicates 
that  LP  can  pack  rectangles  tightly  when  the  width  of  the  bin  is  large.  In 
LP-heuristic,  the  number  of  levels  in  packing  is  small  when  the  width  of  the 
bin  is  large. 

5.  Figure  5.23  £ind  Figure  5.26  indicate  that  TP  performs  always  better  than  LP 
regardless  of  the  number  of  grids  to  be  packed.  Processor  utilization  of  TP 
seems  to  increcise  slightly  as  the  number  of  grids  increases. 
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6.  In  all  the  figures,  both  computation  and  communication  costs  show  the  same 
behavior.  The  reason  for  this  is  that  both  computation  and  communication 
costs  are  inversely  proportional  to  the  number  of  processors  allocated  and  pro- 
cessors are  uniformly  allocated  to  the  grids  according  to  the  number  of  grid 
points. 

7.  In  most  of  the  cases,  TP  outperforms  LP  in  both  total  cost  and  processor 
utilization.  One  of  the  reasons  for  this  is  that  TP-heuristic  takes  advantage  of 
the  fact  that  two  possible  orientations  are  allowed  in  the  packing.  TP  has  the 
additional  advantage  over  LP  in  that  it  avoids  repacking  and  hence  has  less 
time  complexity. 

5.4    Extension  for  Inter- grid  Communication  Cost 

The  solution  method  based  on  two-dimensional  packing  ignores  inter-grid  com- 
munication cost.  In  this  section,  we  discuss  possible  approaches  for  minimizing  inter- 
grid  communication  cost  as  well  as  computation  and  intra-grid  communication  costs. 
First  we  consider  one-dimensional  problem  and  propose  an  iterative  approach.  Then 
we  discuss  the  possibility  of  applying  it  and  other  approaches  to  solving  the  two- 
dimensional  case. 

5.4.1    One-Dimensional  Problem 

Processor  allocation  for  the  one-dimensional  problem  is  relatively  easier  than  two- 
dimensional  problem.  We  can  find  an  optimal  solution  when  there  is  only  one  grid  at 
each  level.  Let  X  be  the  number  of  contiguous  processors  (in  one-dimensional  mesh) 
allocated  for  the  newly  generated  fine  grid.  Then,  the  length  of  the  communication 
path  for  inter-grid  communication  is  X  -f-  a  in  the  worst  case,  where  a  depends  on 
the  actual  locations  of  these  X  processors.  When  there  is  only  one  fine  grid  at  each 
level,  we  can  allocate  X  processors  including  the  processors  where  the  fine  grid  was 
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generated.  Then  a  becomes  a  constant.  For  the  fine  grid  at  level  2,  in  Figure  5.2 
(a)  X  =  2  and  a  =  -2,  and  in  Figure  5.2  (h)  X  =  i  and  a  =  -2.  If  we  uniformly 
divide  the  fine  grid  among  X  processors,  the  number  of  grid  points  each  processor 
is  responsible  for  is  ^,  where  M  is  the  total  number  of  grid  points  in  the  fine  grid. 
Then,  the  problem  is  finding  X  which  minimizes 

M  M 

Ti  =  UcompY  +  Ucomm^iX  +  a)  +  C/ 

The  first  term  is  the  computation  cost  for  each  processor  to  which  ^  grid  points  are 
assigned.  The  second  term  is  the  worst  case  inter-grid  communication  cost  when  the 
data  have  to  travel  through  a  path  of  length  X  +  a.  The  third  term  C/  is  the  cost 
of  intra-grid  communication.  It  is  a  constant  because  regardless  of  X,  it  is  necessary 
to  exchange  only  two  grid  points  (left  end  and  right  end)  values  with  neighboring 
processors.  In  the  above  equation  T,  decreases  indefinitely  up  to  C/  as  A"  increases. 
But,  as  we  increase  X,  ^  reaches  packet  size  Sp  and  communication  cost  is  only 
proportional  to  A"  +  a.  Then,  the  above  equation  becomes 

M  M 
Ti  =  UcompY  +  UcommSp{X  +  a)  +  Ci,  wheu  Y  < 

If  we  differentiate  Ti, 

dTi    Ucomp^   I  TT  Q 

T  UcommJP 


dX  x^ 

^  becomes  0  when  X  is  equal  to  /^'°^p^  .  Therefore,  T,  is  minimized  when  X  is 
equal  to  max(f,yS^). 

This  analysis  is  valid  when  there  is  only  one  fine  grid  at  level  i.  The  problem 
becomes  more  difficult  when  there  are  more  than  one  fine  grid  at  level  i.  If  each 
grid  is  assigned  to  a  set  of  X  processors  that  are  in  the  neighborhood  of  the  set 
of  processors  where  it  was  generated,  a  grid  can  overlap  with  other  necirby  grids  in 
some  processors.  The  workload  in  these  processors  will  be  twice  as  much  as  that 
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in  the  other  processors.  To  avoid  this  workload  imbalance,  some  grids  have  to  be 
assigned  to  processors  far  from  those  where  they  were  generated.  This  will  make  the 
maximum  length  of  data  path  greater  than  X.  Let  us  assume  that  there  are  m  fine 
grids  (grid  1,  grid  2,  •  •  •,  grid  m  from  left  to  right  in  the  one-dimensional  domain)  at 
level  i  and  the  number  of  grid  points  in  grid  j  is  Mj.  We  also  assume  that  grid  j  is 
assigned  to  P,^ ,  , . . . ,  Pr^ ,  and  Xj  =  rj  -  Ij  +  I.  If  we  allow  overlapping  of  grids 
in  some  processors,  workload  imbalance  among  processors  is  inevitable.  Hence,  we 
avoid  assigning  more  than  one  grid  to  any  processor.  Now,  when  there  are  m  grids 
at  level  t,  the  computation  time  Ti  becomes  as  follows: 

Ti  =  max  (u^,^  +  U,ommf{Xi){Xj  +  aj)  +  Ci] 

where  f{Xj)  =  max(^,  Sp).  aj  is  the  extra  path  length  caused  by  assigning  the  grid 
far  from  the  processors  where  it  was  generated  to  avoid  overlapping.  Here,  each  aj 
depends  on  the  assignments  of  the  other  grids  at  the  same  level,  unlike  the  case  when 
there  is  only  one  grid  in  which  case  it  is  a  constant.  Therefore,  we  cannot  compute 
individual  values  of  Xj  in  the  same  way  as  before.  In  general,  it  is  difficult  to  find 
each  Xj  without  knowing  aj.  Here,  we  propose  a  simple  heuristic  using  an  iterative 
procedure. 

1.  For  each     Xj  =  max(^,  )• 

This  is  the  optimal  number  of  processors  allocated  to  grid  j  if  we  ignore  the 
other  grids  at  the  same  level.  If  J2j  Xj  is  greater  than  N,  each  Xj  is  reduced 
proportionally  according  to  Mj. 

2.  Let  Sj  be  the  index  of  the  processor  where  grid  j  was  generated.  If  grid  j 
was  generated  in  a  set  of  processors,  then  let  Sj  be  the  index  of  the  leftmost 
processor.  Allocate  Pi- ,  P/^+i, . . . ,  Pry  to  grid  j  in  the  following  way. 

ptr  =  0 
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for  j  =  1  to  m 

Ij  =  m&x{ptr,  Sj),    Tj  =  Ij  +  Xj-l,    ptr  =     +  1 
if  (ptr  >  N)  then 

ptr  =  N-l 

for  J  =  m  to  1 

rj  =  m\n{ptr,  rj),    Ij  =  rj  -Xj  +  1,    ptr  =  /j  -  1 

3.  Using  the  above  processor  allocations,  compute  T;  and  Oj  =  Ij  —  Sj.  Also 

find  jmax  such  that  Ti  =  Ucomp^t^  +  Uco^mi^^  +  2)(Xj„.„  +  a,„„)  +  Ci. 

Since  eaich  Xj  was  obtained  assuming  aj  to  be  0,  we  may  be  able  to  reduce  Ti 
by  reducing  aj„„,.  Assume  that  >  0  and  let  A;  =  maX(a,=o)  fc  (/<im..)  ^• 

Reduce  Xfc,  Xt+i, . . .  by  one  and  reallocate  processors  to  each  grid  j 

(k  <j  <  jmax)- 

4.  Recompute  Ti  using  the  above  reallocation  of  processors.  If  it  was  improved 
then  go  to  step  1. 

5.4.2    Two-Dimensional  Problem 

Considering  inter-grid  communication  cost  for  two-dimensional  problem  is  very 
complicated.  When  the  grids  clustered  on  the  computational  domain,  allocation  of 
processors  for  one  grid  cannot  be  determined  independently  since  grids  are  closely 
located  to  each  other.  The  iterative  heuristic  used  for  one-dimensionaJ  problem  pre- 
viously can  be  considered  for  solving  the  two-dimensional  problem.  But,  cifter  we 
find  Xi  X  Yi  submesh  for  each  grid  i  assuming  there  is  only  one  grid  at  a  level,  we 
still  need  to  find  the  locations  of  these  submeshes.  For  the  one-dimensional  prob- 
lem, we  eissigned  m  linecir  subarrays  in  the  same  order  cis  the  grids.  But,  we  need 
to  solve  a  packing  problem  to  arrange  these  submeshes  without  overlapping.  This 
packing  problem  is  different  from  the  one  we  considered  in  the  previous  section  since 
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we  have  to  assign  submeshes  that  are  somewhat  close  to  the  original  locations  of  the 
grids.  This  problem  need  to  be  solved  again  and  again  after  we  reduce  the  sizes  of 
submeshes  by  some  quantities.  For  this  reason,  we  cannot  apply  the  same  heuristic 
to  the  two-dimensionzil  problem  without  major  changes. 

One  possible  solution  for  assigning  submeshes  is  starting  with  a  submesh  for  a  grid 
located  at  the  center,  then  assigning  other  submeshes  around  it  using  a  simple  packing 
algorithm.  But,  after  reducing  the  sizes  of  the  submeshes,  we  have  to  rearrange  the 
submeshes  so  that  maximum  inter-grid  communication  distance  is  reduced.  This  may 
not  be  a  trivial  problem  without  overlapping  submeshes. 

Another  possible  way  of  allocating  submeshes  considering  inter-grid  communica- 
tion cost  is  imposing  a  constraint  on  it  as  we  did  in  the  first  approach.  When  the  grids 
are  clustered,  we  can  compute  the  size  of  a  submesh  which  all  m  grids  will  be  assigned 
to.  Let  D{PSi)  be  the  diameter  of  a  submesh  PSi,  then  maXaes  UcommD{PSi){Ba  + 
Ma)  <  Cmax  where  all  the  other  notations  are  defined  in  chapter  3.  The  size  of  the 
submesh  should  be  small  enough  so  that  maximum  inter-grid  communication  cost 
cannot  exceed  the  limit  no  matter  which  processors  are  allocated  for  a  grid  within 
the  submesh.  Then,  we  can  use  the  two-dimensional  packing  approach  described  in 
the  previous  section  to  allocate  processors  for  each  grid  within  the  submesh.  Since  we 
do  not  use  all  the  processors,  computation  time  will  increase.  When  there  are  more 
thein  one  cluster  of  grids,  we  have  to  find  a  submesh  for  ecich  cluster.  This  problem 
is  the  same  as  finding  a  submesh  for  each  grid,  but  the  number  of  submeshes  we 
need  to  find  is  much  smaller  because  we  need  one  submesh  per  a  cluster  of  grids.  If 
the  number  of  clusters  is  reasonably  small,  it  will  be  easier  than  solving  the  original 
problem. 

There  is  another  alternative  solution  when  the  grids  are  scattered  instead  of  clus- 
tered.  Consider  mesh  processors  as  domain  and  fine  grids  generated  by  them  as 
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workload.  Then  we  partition  the  processors  as  we  decompose  the  domain  using  bi- 
nary decomposition  (see  Figure  5.27).  The  processors  are  partitioned  in  such  a  way 
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Figure  5.27.  Partitioning  of  Processors 

that  the  computational  requirement  of  grids  are  balanced  among  the  different  groups 
of  processors.  We  need  to  repeat  partitioning  the  processors  until  the  maximum  inter- 
grid  communication  cost  cannot  exceed  the  limit  when  processors  are  allocated  within 
the  group.  Then  we  agaiin  use  the  packing  approach  for  each  group  of  processors. 
This  approach  will  work  when  the  fine  grids  are  reasonably  scattered. 

We  considered  a  few  possible  approaches  to  include  inter-grid  communication  cost 
in  our  objective  function.  Each  approach  seems  to  be  suitable  for  a  particular  type 
of  coarse  grid  distribution  cimong  the  processors.  But  deciding  which  approach  to 
apply  after  generating  a  set  of  fine  grids  is  a  challenging  problem.  Thus,  the  inclusion 
of  inter-grid  communication  cost  makes  the  load  balancing  problem  complicated  and 
requires  further  study. 


CHAPTER  6 
CONCLUSIONS  AND  FUTURE  WORK 

In  this  dissertation,  we  discussed  a  dynamic  load  balancing  problem  that  arises 
in  mapping  adaptive  mesh  refinement  computations  onto  multiprocessor  systems.  We 
identified  three  different  approaches  for  dealing  with  this  problem,  and  formulated 
the  load  balancing  problem  in  all  the  three  cases.  In  the  first  approach,  we  treated 
grids  as  indivisible  computational  entities.  We  developed  a  heuristic  algorithm  which 
distributes  grids  among  the  processors  in  such  a  way  that  workload  is  attempted  to 
be  balanced  and  inter-grid  communication  cost  does  not  exceed  an  acceptable  limit 
at  the  same  time.  By  adjusting  this  parameter,  we  can  provide  a  suitable  trade- 
off of  computation  cost  and  communication  cost.  We  also  developed  a  parallel  load 
balancing  algorithm  to  reduce  the  overhead  imposed  by  load  bedancing  eictivities. 
The  parallel  algorithm  shares  the  same  idea  of  balancing  the  workload  with  the 
sequential  algorithm.  But  the  computational  task  is  completely  parallelized  and  no 
computation  is  performed  redundantly.  It  also  eliminates  the  necessity  for  a  single 
processor  to  collect  all  the  necessary  information,  thus  avoiding  a  communication 
traffic  congestion.  Since  a  grid  cannot  be  assigned  to  more  than  one  processor,  the 
number  of  grids  should  be  larger  than  or  equal  to  the  number  of  processors  in  order  to 
avoid  processor  idle  time  for  the  first  approach.  We  performed  extensive  experiments 
to  analyze  the  accuracy  of  solutions  provided  by  our  heuristic  compared  to  some  well 
known  multiprocessor  scheduling  heuristics  such  as  decreasing-fit  and  list  scheduling. 
In  the  experiments,  we  also  showed  that  our  heuristic  algorithm  produced  better 
solutions  than  these  heuristics  especially  when  the  number  of  grids  is  sufficiently 
large. 
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The  second  approach  which  we  call  domain  decomposition  is  more  general  than 
the  first  approach  because  the  number  of  grids  does  not  have  to  be  related  to  the 
number  of  processors  for  efficient  mapping.  But  this  approach  suffers  from  intensive 
communication  cost  especially  when  the  load  balancing  is  done  at  the  highest  level 
of  adaptive  mesh  refinement.  We  identified  that  the  load  balancing  problem  here  is 
similar  to  the  problem  considered  in  [31]  for  load  balancing  with  resource  migration 
in  distributed  systems  if  the  coarse  grids  were  treated  as  resources. 

The  third  approach  is  also  more  general  than  the  first  approach  but  it  avoids 
excessive  communication  cost  of  the  second  approach.  In  this  dissertation,  we  ignored 
inter-grid  communication  cost  in  the  third  approach  to  make  the  problem  somewhat 
easier  to  solve.  The  resulting  problem  is  a  processor  allocation  problem.  We  related 
this  processor  allocation  problem  to  two-dimensional  packing  problem  which  has 
many  practical  applications  such  as  scheduling  of  tasks  with  shared  resources  and 
cutting  stocks.  We  slightly  modified  the  original  two-dimensional  packing  problem  to 
make  it  more  suitable  for  our  processor  allocation  problem.  We  developed  an  efficient 
packing  algorithm  and  showed  that  in  most  of  the  cases,  our  algorithm  performed 
better  than  a  known  packing  algorithm  through  experiments.  We  also  showed  a 
bound  for  the  accuracy  of  the  solutions  provided  by  our  packing  algorithm.  As  was 
done  in  the  first  approach,  we  parallelized  our  algorithm  and  achieved  a  reduction 
in  time  complexity.  This  approach  works  well  when  the  number  of  processors  is 
much  larger  than  the  number  of  grids,  in  contrast  to  the  first  approach.  To  solve 
the  problem  including  inter-grid  communication  cost,  we  proposed  several  different 
approaches.  Some  of  these  approaches  use  the  two-dimensional  packing  as  a  part  of 
the  solution. 

A  logical  extension  of  our  work  in  the  future  is  to  explore  the  load  balancing 
approaches  for  adaptive  mesh  refinement  algorithms  used  in  solving  time  dependent 
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partial  differential  equations.  In  time  dependent  problems,  large  time  steps  are  often 
used  for  coarse  grids  while  small  time  steps  are  used  for  fine  grids.  And  at  every  level 
/  and  for  every  time  step  Ati,  computation  on  all  the  grids  at  level  /  and  above  (finer 
grids)  are  synchronized.  In  addition,  the  grids  can  be  created,  deleted  or  can  move 
spatially  with  time.  We  will  investigate  the  extension  of  our  approaches  for  non-time 
dependent  problems  to  time  dependent  problems. 

Finally,  we  plan  to  implement  our  load  balancing  algorithms  on  parallel  com- 
puters. In  our  algorithms,  we  perform  many  different  parallel  computations  such  as 
sorting,  prefix  computation  and  various  packet  routings.  We  will  implement  these 
operations  so  as  to  minimize  the  communication  and  synchronization  overheads  that 
is  implicit  in  the  description  of  our  algorithms.  In  particular,  for  the  sorting  problem, 
which  is  the  most  time  consuming  operation  in  our  algorithm,  we  will  search  for  a 
parallel  algorithm  for  our  target  architecture  that  is  not  only  efficient  in  terms  of 
asymptotic  time  complexity  and  speed  up,  but  is  simple  enough  to  be  implemented. 
The  load  balancing  algorithms  should  be  implemented  as  library  routines  which  can 
be  incorporated  by  the  user  in  his  or  her  parallel  adaptive  mesh  refinement  algorithms. 
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